Re: 2.2.10 oops (finally, something I can report!)

Aaron Lehmann (aaronl@vitelus.com)
Wed, 30 Jun 1999 21:43:09 +0000 ( )


Linus,

Thanks for the reply!

Just a clarification - In my original message to the list I wasn't trying
to complain about specifically my stability problems, but I had seen a lot
of Oopsen on the list and I wanted to comment on the general situation of
2.2.x stability, including my own expriences. Sorry if it sounded like I
was complaining about how Linux crashed for me, if it hadn't been for all
the oopsen I saw on the list I would have suspected a hardware problem.

On Wed, 30 Jun 1999, Linus Torvalds wrote:

> In article <Pine.LNX.4.05.9906300304020.7161-100000@vitelus.com>,
> Aaron Lehmann <aaronl@vitelus.com> wrote:
> >This time I had the fortune of an Oops that didnt lock up the machine. I'm
> >going to apply KMSGDUMP so I can send all future oopses also.
> >
> >I hope this helps fix the stability problems:
> >
> >Reading Oops report from the terminal
>
> Interesting.
>
> The oops looks fine. The symbolic information also looks fine: the code
> in question does in fact look like it is the second instruction in
> "inet_sendmsg()". Everything basically seems to say that the oops is
> correctly decoded and caught.
>
> The thing that does NOT make sense is the cause of the oops itself,
> though.

Another kernel hacker pointed this out, but I did not know what it meant.

> The oops happens on
>
> c017b651 pushl %ebx
>
> and %esp = c3941e80.
>
> And quite frankly, there's not a way in h*ll that that instruction could
> raise the exception in question. But it does.
>
> I would _strongly_ suspect one of two things:
> - bad CPU.
> - bad cache or RAM timings.

I don't want to troubleshoot a hardware problem on linux-kernel, but I
strongly suspect that the CPU or ram is not at fault. I have been running
Linux on this machine ever since September and never changed any bios
settings (except enabling apm monitor blanking) since then. Heat is not a
problem since the machine is idle most of the time and oopsen usually
occur at a load level below 0.10, which is where the machine is at
usually. Running processor-intensive tasks for hours does not seem to
trigger anything, even on a hot summer day.

But SCSI might be related to these Oopsen. I have an AdvanSys SCSI
controller and two old, heavy, hot, SCSI drives. One of them has been
constantly spitting out errors and corrupting data whenever a sector of it
is accesed. This has been going on for about a month. Linux didn't start
oopsing until a week ago, but perhaps the advansys or scsi driver is
barfing on these errors and screwing up somewhere... here are a few sample
errors:

Jun 27 14:11:14 vitelus kernel: SCSI disk error : host 0 channel 0 id 1
lun 0 return code = 8000002
Jun 27 14:11:14 vitelus kernel: Current error sd08:01: sense key Recovered
Error
Jun 27 14:11:14 vitelus kernel: Additional sense indicates Recovered data
with error correction applied
Jun 27 14:11:14 vitelus kernel: scsidisk I/O error: dev 08:01, sector
610178
Jun 27 14:11:15 vitelus kernel: SCSI disk error : host 0 channel 0 id 1
lun 0 return code = 8000002
Jun 27 14:11:15 vitelus kernel: Current error sd08:01: sense key Recovered
Error
Jun 27 14:11:15 vitelus kernel: Additional sense indicates Recovered data
with error correction applied

... and it goes on for quite awhile.

A few hours after this oops, I got another. The machine rebooted. Weird,
since I told KMSGDUMP to wait for a disk to be inserted and a key to be
pressed and then reboot the machine. Didn't work I guess. So I took the
oppertunity to disconnect the failing hard drive (I needed themally
insulated gloves ten minutes after it had spun down!). Since then, I
haven't had any kernel problems. But its too late to declare a victory
becuase between then and now has been less than the
mean-time-between-oopsen.

Even if it is true that a SCSI drvie was causing the Oopsen, this seems
like a kernel bug. The drive never touches memory or uses the CPU, so it
would have to be a problem in the SCSI card, its driver, or the SCSI
subsystem of the kernel.

> Basically, the instruction cannot raise that exception with those
> inputs. So either the CPU is just doing something randomly wrong due to
> internal corruption, OR the CPU gets fed the wrong data at some earlier
> point, and when the exception happens and we re-fetch that data, now it
> is magically ok again because the timings were better this time.

I'm no kernel hacker, or even a great programmer for that matter, but
would it be possible for some bug to corrupt the stack or whereever such
information in an oops is obtained from so it is no longer accurate?

> Or something.
>
> Note that the "bad CPU" thing may have been brought about by the MTRR
> changes: maybe Linux sets up some Cyrix CPU state (it was a Cyrix CPU,
> right?) incorrectly.
>
> Oh, and do you get the message
>
> Cyrix processor with "coma bug" found, workaround enabled
>
> at booptup? Maybe that workaround does something else bad.

My dmesg output is at the end of the message. Yes, it has the "coma bug".

> So I would strongly suggest turning off MTRR support, and see if the
> behaviour is more reliable.

This does make sense, I think that Cyrix MTRR suport first got merged in
in 2.2.9, which is when the I started getting Oopsen. If I do have more
problems, I will try disabling it.

> I would also suggest making sure that everything is properly cooled:
> overheating can easily result in random problems - corrupting internal
> CPU state resulting in basically random behaviour.
>
> Linus
>

Thanks again for helping me - You don't get Windows help from Bill Gates.

Linux version 2.2.10 (root@vitelus.com) (gcc version egcs-2.91.66
19990314/Linux
(egcs-1.1.2 release)) #2 Tue Jun 29 20:15:32 PDT 1999
Detected 187400617 Hz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 186.78 BogoMIPS
Memory: 62952k/65536k available (1060k kernel code, 408k reserved, 1052k
data, 6
4k init)
Checking if this processor honours the WP bit even in supervisor mode...
Ok.
VFS: Diskquotas version dquot_6.4.0 initialized
CPU: Cyrix 6x86MX 2.5x Core/Bus Clock stepping 07
Checking 386/387 coupling... OK, FPU using exception 16 error reporting.
Checking 'hlt' instruction... OK.
Cyrix processor with "coma bug" found, workaround enabled
POSIX conformance testing by UNIFIX
mtrr: v1.35 (19990512) Richard Gooch (rgooch@atnf.csiro.au)
PCI: PCI BIOS revision 2.10 entry at 0xfdb91
PCI: Using configuration type 1
PCI: Probing PCI hardware
Linux NET4.0 for Linux 2.2
Based upon Swansea University Computer Society NET3.039
NET4: Unix domain sockets 1.0 for Linux NET4.0.
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
Starting kswapd v 1.5
parport0: PC-style at 0x378 [SPP,PS2]
parport0: Printer, Hewlett-Packard HP LaserJet 4ML
Detected PS/2 Mouse Port.
Serial driver version 4.27 with no serial options enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
lp0: using parport0 (polling).
apm: BIOS version 1.2 Flags 0x03 (Driver version 1.9)
RAM disk driver initialized: 16 RAM disks of 4096K size
SIS5513: IDE controller on PCI bus 00 dev 09
SIS5513: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x4000-0x4007, BIOS settings: hda:pio, hdb:pio
ide1: BM-DMA at 0x4008-0x400f, BIOS settings: hdc:pio, hdd:pio
hda: SAMSUNG VG32163A (2.16GB), ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: SAMSUNG VG32163A (2.16GB), 2063MB w/496kB Cache, CHS=524/128/63
Floppy drive(s): fd0 is 1.44M
FDC 0 is an 8272A
scsi0 : AdvanSys SCSI 3.1E: ISA PnP 16 CDB: BIOS C800, IO 110/F, IRQ 11,
DMA 5
scsi : 1 host.
Vendor: COMPAQ Model: C2490A Rev: 3184
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 4, lun 0
Vendor: COMPAQ Model: CD-ROM CR-503BCQ Rev: 1.1i
Type: CD-ROM ANSI SCSI revision: 02
Detected scsi CD-ROM sr0 at scsi0, channel 0, id 5, lun 0
scsi : detected 1 SCSI cdrom 1 SCSI disk total.
Uniform CDROM driver Revision: 2.55
SCSI device sda: hdwr sector= 512 bytes. Sectors= 4110000 [2006 MB] [2.0
GB]
3c59x.c:v0.99H 11/17/98 Donald Becker
http://cesdis.gsfc.nasa.gov/linux/drivers/
vortex.html
eth0: 3Com 3c905 Boomerang 100baseTx at 0xf700, 00:60:97:31:d9:bd, IRQ 10
8K word-wide RAM 3:5 Rx:Tx split, autoselect/MII interface.
MII transceiver found at address 24, status 782d.
Enabling bus-master transmits and whole-frame receives.
eth1: 3Com 3c905B Cyclone 100baseTx at 0xf480, 00:10:4b:79:46:76, IRQ 9
8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
MII transceiver found at address 24, status 786d.
MII transceiver found at address 0, status 786d.
Enabling bus-master transmits and whole-frame receives.
Partition check:
sda: sda1
hda: hda1 hda2 < hda5 > hda3
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 64k freed
Adding Swap: 128988k swap-space (priority -1)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/