Re: 2.0.33 kernel lockups

Robert G. Brown (rgb@phy.duke.edu)
Wed, 4 Mar 1998 15:43:29 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Keith Rowland: "2.0.29/ISS (Replacement for 2.0.33)"
Previous message: Larry M. Augustin: "mysterious 2.0.30+ hangs"

On Tue, 3 Mar 1998, Donald Becker wrote:

> The system is not merely "stacking interrupts", which would be OK. It's
> violating the semantics of an interrupt handler by calling the interrupt
> handler again while it's already handling an interrupt.
> - This only happens on a dual processor system, so it must have something
> to do with the SMP interrupt dispatch.
> - Reports have been consistent that all interrupts are being handled by
> processor #0, so my earlier theory that processor #1 was calling the
> interrupt handler was likely incorrect.

As I understand it (experts, please correct me as I'm dyin' out here)
all the hardware interrupts on an Intel system go to one processor (the
boot CPU, usually CPU 1 if I recall correctly). There is only a single
kernel lock. A kernel deadlock occurs if CPU 0 is handling an interrupt
(and hence holds the kernel lock) and requires an interrupt to
complete. The interrupt can be accepted on CPU 1, which cannot obtain
the kernel lock from CPU 0, which it cannot give up until the requested
interrupt completes. So, if the tulip_interrupt routine itself required
an interrupt to complete one could get a deadlock. I didn't see
anyplace in the code where this was obviously happening, but you would
know if there was one.

The solution (if in fact this is occuring) is to place an
"allow_interrupts" call before invoking the interrupt. There are a
bunch of comments in kernel/sched.c about the allow_interrupts() call
and one can look at its implementation in buffer.c and keyboard.c.

However, it doesn't >>look<< to me like this is what is happening.
Instead, I think that what is happening is that when a very high speed
datastream is incident on the network interface AND the tulip_interrupt
handler happens to be running on CPU 0, a lock is somehow not being set
in an SMP-reliable way and a second eth0 interrupt is accepted on CPU 1
and scheduled on CPU 0 before the first one returns. Instead of a
deadlock, one gets a nested, recursive corruption of the kernel.
Presumably this doesn't happen more than once in a >>very<< rare while
as long as the time required to process a networking interrupt is small
compared to the interpacket latency, which is why it is only just now
being revealed as systems capable of generating ~40Kpps are coming
online. It may be that when the original hooks were written (back in
the days of single processors running 10B only) there was no need for
locks since the interpacket latency was guaranteed to be long compared
to the time to process the interrupt...

In a moment I'm going to try hitting my system with an 80Kpps
stream (the two PII's) with a single CPU kernel running just to verify
that it is an SMP lock that is failing. If it is, then I'll try to find
the points were the locks are being set that "should" prevent this from
happening. I wish I knew more about the kernel (I do have Back and The
Kernel Hacker's Guide) but I suppose it is a great time to learn...

rgb

Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu

Next message: Keith Rowland: "2.0.29/ISS (Replacement for 2.0.33)"
Previous message: Larry M. Augustin: "mysterious 2.0.30+ hangs"