2.2.X lockups (summary attempt)

Brandon Black (brandon.black@wcom.com)
Tue, 05 Oct 1999 22:29:24 -0500


I'm not the best person to do this by a long shot, since I don't
understand Linux internals worth anything.... but somebody needs to
start this thread....

There have now been numerous reports of "strange lockup", "hang with no
error output", etc.. on the lkml regarding versions 2.2.x (mostly regard
2.2.11/12)... Here's what I think I gather from a cursory review of
them:

1. It is most likely a very well hidden kernel bug, as opposed to a
hardware issue. I base this on the fact that the hardware this has been
reported has varied drastically in every way, there isn't much in the
way of common hardware between the cases.

2. That being said, it seems the majority of the reporters are running
fast CPU's of some brand or another (specifically, one of the most
recent reporters only saw the bug after upgrading from a 300 to 400
cpu), therefore it might actually take a fair bit of cpu speed in order
to trigger the bug (don't ask me why.... is the kernel racing itself,
and a higher speed makes the hang more likely?)

3. It seems very elusive, almost to the point of randomness.... In my
case of this "hang", at times I have been able to reproduce it
repeatably... at other times I can't get it to come back under what
I believe to be identical conditions.... and certainly I never got it to
reproduce on any kernel w/ IKD patches (even with only the EIP printing
turned on).

4. I haven't seem (maybe I missed them?) any reports of this on SMP
boxes... is it possible that this has an effect? Has anyone tried
running an SMP kernel on their UP box to see if the hang goes away?
Could it be that some extra locking done in the SMP code prevents the
bug???? It would at least be an interesting thing to see if any SMP-ers
can produce this bug. -- (for my part, I have another machine nearly
identical to my hanging machine, which happens to be a dual processor
SMP... it is a firewall... it has had an uptime (currently) of 39 days
w/ 2.2.12 w/o a single problem.))

5. At least a few of us have reported that the problem is triggerable
with intense disk activity (although intense disk activity isn't
necessary for it to happen randomly on its own), and at least one person
reports indications of memory corruption under heavy disk activity
(indicated bya kernel oops message which displayed corrupted invalid
text on the screen).... all of this is dubious evidence at best, but
might be pointing in the right general direction....

Obviously a lot of the above is rough guessing.... if everyone who is
having these problems would send me personal email with your hardware
config, kernel config, and a list of kernel patches you were running
(hopefully none or ikd) at the time.. I'll try to do the legwork of
verifying what is or is not common between all of the cases, so that
hopefully we will be better informed about what might really be going on
here and where to look to try and track it down....

Please mark the subject "2.2.X lockup" or something so I know what to
look for when scanning my (usually bloated) inbox....

Regards,
Brandon

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/