Is there any chance that it will ever work with X? I had such a lockup
(with interrupts enabled, since the machine was responding to ping and
accepting TCP connections but just sitting there afterwards) with 1.2.9.
Unfortunately, I was unable to give more details about the problem...
It would be much easier to find the bug if I could use Alt-ScrollLock
to display the EIP value, but this doesn't work when running X. Lockups
are very rare (this one was after 8 days uptime) and I can't afford not
to run X for such a long time... And what if it is triggered by X only?
This is possible since such lockup never happened on a different machine
which is not running X. There were a few lockups in the past, but they
were different (no response to ping). They seem to be gone now (knock
wood). Note that this is not an NE2000 - but WD8003...
BTW, I have some ideas about the register dump feature. Since it is
explicitly requested by the user, why not always print it on the console
(like all the Oops messages) even if klogd is running? If something bad
happened, it may be difficult to log in and kill klogd...
Another idea - a hack, but it would make it possible to debug problems
like the above - set up a simple UDP service in the kernel space, which
would just send the current EIP value (like the one displayed after
Alt-ScrollLock is pressed) when requested. This should work even if
no processes are running due to an infinite loop in the kernel (but
only if interrupts are enabled, of course). Is this possible?
I noticed that there are many potentially infinite loops in the kernel,
for example walking through circular lists - no problem if everything
is OK, but guess what happens if a pointer gets corrupted due to some
other bug. Many structures are not protected by magic values - I guess
this is for speed reasons, but this could be made a compile time option -
what do you think? I can afford to slow down the system a bit if it helps
to find one of the very few remaining bugs...
Marek