Re: spinlock on Alpha ES40

From: Andrew Pochinsky (avp@honti.mit.edu)
Date: Sat Jun 24 2000 - 17:09:14 EST


   Sender: frival@wavy.zk3.dec.com
   Date: Thu, 22 Jun 2000 11:31:43 +0000
   From: Peter Rival <frival@zk3.dec.com>

   Without getting into a flame-fest about just how much locking is enough versus how much is too much (way too much intersection on those
   points...) let's just say that this particular situation is _much_ better in the 2.3/2.4 series. I've got an ES40 with 60+ disks and 4 GB of
   memory and I don't see messages like this too much until I _really_ load the system with the latest kernels.

No intention to start a flame war. I just thought that the spinlock
messages might have something to do with the lockup. After I loaded
2.2.16 on a couple of the machines, it seems that there is a lot
_less_ spinlocks. Have to wait and see if it will lockup
eventually. (BTW, running on a 164 uniprocessor does not cause this
sort of problems -- one of the machines is up for more than two months.)

   <snip>

> Sometimes, the machine goes completely catatonic and should be
> resetted. Less often the system really crashes. My estimate is that
> this lockup happens once in about a fortnight; out of 10 machines we
> are running there is somewhat less that one failure per day.
>

   _This_ is not good. Do these hangs have a "stuck" line with no "grabbed" line afterwards? Other than some serious problems with the QLogic
   driver (and apparently only when attached to a RAID array...hrmmm....) I haven't been able to take my system down with the latest 2.3/2.4
   kernels - haven't really tried 2.2 since some of the 2.2.14pre series.

Yeap. That's what happened the other day with the stock 2.2.14 from
RedHat 6.2 distribution. The machine was completely unresponsive (the
console did process the keyboard events, but getty seems otherwise
engaged; I had to powercycle the poor creature.) The address in the
"spinlock stuck" messages points inside
kernel/sched.c:schedule(void). Interestingly, inline_pc and
lock->previous are the same.

   If you can, try the latest 2.3/2.4 kernels. They're much more scalable and at least for me very stable. BTW, what are you running on these
   systems? I'm just curious if there's a load I can put on my system to replicate what you're doing... :)

I'll try to get 2.4 on a couple of the machines, but it could take
some time. Concerning stuff we are running on the machines: I asked
one of the users for the codes he was running; the binaries (with some
sources, but I afraid not all the sources) are at
<ftp://www.lns.mit.edu/pub/avp/spin-lock-chant.tar.gz>.

It still might be that spinlock and lockup are two completely separate
problems, but assuming they are related, I'm wondering if something
like srm_check
<http://www.openvms.digital.com/openvms/21264_considerations.htm> was
ported/written for linux and if any of kernels were validated against
it. (If not, I'd volunteer to write something along these lines. The
description on the DEC's page seems straightforward.)

Thanks,
--andrew

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jun 26 2000 - 21:00:05 EST