Crashes under load with 2.0.27

Frank Pavageau (pavageau@imaginet.fr)
5 Jan 1997 18:27:46 GMT


I've been having crashes for the past 3 weeks, 3 of them led to
reinstallation of the system. It might be a hardware problem, but
there are a few strange things that happen. So here's the story :

The first crash happened while I was connected by PPP. I was under X,
writing a mail with pine. As I sent the mail and pine forked into
sendmail, X scrolled up and the computer froze completely. Reset,
reboot, e2fsck which finds almost nothing (a few inodes with zero
dtime), and back under X with PPP up. Then I launch suck to get my
news, and after innxmit has locally posted the articles, I've lost
/usr/bin and /usr/lib. Unfortunately, it was kind of late and I
forgot to save the kernel log.

I reinstall the system, and 2 days later, I get a new crash with the
HD making a spooky noise, but this time X is not locked. I switch to
tty0 and get a lot of error messages from the IDE driver saying that
it gets timeouts and is trying to reset. This time I had Crack running
for a couple of hours before using my computer, and it looks like the
crash had been waiting for me, since there had a been a lot of
VFS: Wrong blocksize on device 16:01
in the kernel log while it was running alone. But since Crack doesn't
make a lot of disk access, I find it quite strange. Anyway the crash
happened just after I started PPP with oopses from sh and chat, and
Unable to handle kernel paging request at virtual address c813fe7f.

Next day, another crash :
Aiee: scheduling in interrupt 0011ad3d (repeated 59 times)
general protection: 0000 (from sendmail)
and more astonishing :
attempt to access beyond end of device
16:01: rw=0, want=1735289205, limit=487336
attempt to access beyond end of device
16:01: rw=0, want=135652385, limit=487336
attempt to access beyond end of device
16:01: rw=0, want=1952541795, limit=487336
...
EXT2-fs warning (device 16:01): ext2_free_inode: bit already cleared for inode 49217 (twice)
and then Oops from suck.

Next day, yet another crash, but I can't find the log anymore. Anyway,
the message was from the scheduler this time :
wait_queue is bad (eip = xxxx)
q = xxxxx
*q = yyyyy

Then I tried with another HD to put the system on. It ran without any
problem for 10 days, and then, yesterday, I got a crash after trying
Crack5 for a couple of hours. This time, I had no / left (but I still
had /sbin, /usr, /whatever, just no /). I managed to shutdown, but
then it wouldn't boot (kernel panic trying to mount its root
readonly), and of course I didn't find any boot disk with e2fsck on it
(it would be cool if debian had it). So I reinstalled once again. I've
not yet crashed again, but it could be on its way cause I got that :
EXT2-fs error (device 03:41): ext2_read_inode: bad inode number: 1920169263
after compiling the kernel twice.

Since I was almost every time running crack when it happened (either
4.1 or 5), it looks like it happens under CPU load. What bugs me is
the memory corruption that seems to result (abnormally large inode or
block numbers). I ran memtest-86 for 6 hours yesterday night and it
found no error at all, so the RAM seems fine. It happened with 2
different HD (both Quantum ones, but the first one was a 540 Mb 18
months old, and the second one is a brand new 2.5 Gb). This leaves the
EIDE controller, which is integrated on my Gigabyte GA-586 MB. The
only things I find about it in the MB documentation is "Onboard CMD
IDE port" (it's not a CMD640, I tried enabling the kernel support and
it never said anything), and the basic electronic diagram which link
the IDE ports with 2 chips : S82371FB (Triton I guess), and 74F245.
But then I don't understand why a suddenly buggy hardware would
corrupt inode tables (returning strange block numbers, I could
understand that).

I never had any problem before, and I've had this MB for 18 months.
I can just add a few things about my configuration :
MB: Gigabyte GA-586 with P100, 256 kb cache, integrated EIDE controller
RAM: 32 Mb
HDs (first config) :
hda: WDC AC2200F, 202MB w/64kB Cache, CHS=989/12/35 (DOS)
hdb: ST3491A-XR, 408MB w/120kB Cache, CHS=899/15/62 (hdb1 DOS,
hdb2 /home)
hdc: QUANTUM MAVERICK 540A, 516MB w/98kB Cache, LBA, CHS=1049/16/63
(hdc1 /
hdc2 swap, 40 Mb)
hdd: FX001DE, ATAPI CDROM drive
HDs (new config) :
hda: same as before
hdb: QUANTUM BIGFOOT2550A, 2457MB w/87kB Cache, LBA, CHS=624/128/63, DMA
(hdb1 /
hdb2 swap, 64 Mb
hdb3 /usr
hdb4 /home)
hdc: old hdb, but hdc2 (old /home) unused right now
hdd: same as before
Kernel 2.0.27, hdparm 3.1.

After yesterday crash, I used hdc2 as /var, so should the computer
crash again, I will hopefully be able to get the kernel log... AND, I
won't lose my news and mail spools once again.

So, does anyone have any idea about what could be happening ? Should I
get a new EIDE controller, or is it related to another hardware
problem, or could it be the kernel ? Help ! :)

Frank

-- 
X is a single letter denoting the unknown. This is X too.
Motif is what everyone uses to annoy people that dont have it.
Openwindoze is related to X only harder to spell and slower to use.
			[Peter Evans (peter@gol1.gol.com) in comp.unix.admin]