Re: 2.2.X lockups (summary attempt)

Nils Faerber (Nils.Faerber@unix-ag.org)
Thu, 7 Oct 1999 10:51:43 +0200


Hello!

> I'm not the best person to do this by a long shot, since I don't
> understand Linux internals worth anything.... but somebody needs to
> start this thread....
Indeed. I also think that we have a big problem here.

> 1. It is most likely a very well hidden kernel bug, as opposed to a
> hardware issue. I base this on the fact that the hardware this has been
> reported has varied drastically in every way, there isn't much in the
> way of common hardware between the cases.
It seems to happen more frequently on Celeron systems than any other. Maybe
because teh Celeron on die cache runs at CPU speed, i.e. is faster?

> 2. That being said, it seems the majority of the reporters are running
> fast CPU's of some brand or another (specifically, one of the most
> recent reporters only saw the bug after upgrading from a 300 to 400
> cpu), therefore it might actually take a fair bit of cpu speed in order
> to trigger the bug (don't ask me why.... is the kernel racing itself,
> and a higher speed makes the hang more likely?)
Right! The 'upgrader' was me. Sad but true yesterday kernel 2.2.13pre14 with
hedrick's ide patches for pre14 also froze :( The only difference so far it
seems was that the notebook that the kernel ran on was running on batteries
yesterday. Normally I have AC power connected and everything worked
perfectly well for more than three days with it.
Since it seems to be a timing problem I had some thoughts about this issue.
Doesn't the APM BIOS signal some kind of event when battery charging changes
are detected? Maybe those 'signals' have been the reason for some weird
timing changes and the freezes I experienced yesterday.

> 3. It seems very elusive, almost to the point of randomness....
I have the impression that right after the freeze my harddisk seems to
timeout on some started transfer because any time the machine froze I heard
a little 'click' (head movement) just a second after the freeze.
Interestingly the freezes _never_ happen during e2fsck on startup but at any
point after that.

> 4. I haven't seem (maybe I missed them?) any reports of this on SMP
> boxes... is it possible that this has an effect? Has anyone tried
> running an SMP kernel on their UP box to see if the hang goes away?
> Could it be that some extra locking done in the SMP code prevents the
> bug???? It would at least be an interesting thing to see if any SMP-ers
> can produce this bug. -- (for my part, I have another machine nearly
> identical to my hanging machine, which happens to be a dual processor
> SMP... it is a firewall... it has had an uptime (currently) of 39 days
> w/ 2.2.12 w/o a single problem.))
Some time ago I tried Win NT on the same machine and NT once dumped
bluescreen right after boot. The screen said something about an unhandled
interrupt 0x1f. So I browsed through the Linux kernel to see where 0x1f is
handled and found some code for initialization of higher interrupts for the
APIC used on SMP machines. So I compiled for SMP in the hope that this could
a cure but the follwing SMP kernel did also freeze some time later :(

> 5. At least a few of us have reported that the problem is triggerable
> with intense disk activity (although intense disk activity isn't
> necessary for it to happen randomly on its own), and at least one person
> reports indications of memory corruption under heavy disk activity
> (indicated bya kernel oops message which displayed corrupted invalid
> text on the screen).... all of this is dubious evidence at best, but
> might be pointing in the right general direction....
I have the impression that the freeze show up more frequently after some
time of constant hdd access; in my case I was playing audio files for some
time, i.e. the buffer cache is almost full.
And when then doing something completely different like changing a directory
using bash's TAB completion the freeze occurs. This happened several times
now.

> Regards,
> Brandon
CU
nils

-- 
Nils Faerber (Linux Nils)        eMail: nils@unix-ag.org
Student of computer science      http://www.si.unix-ag.org/~nils/
Unix user group, University of Siegen, Germany

Siegen ... the arctic rain forest!

--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/