Re: Oops and painful death of box, possibly solved

Simon Kirby (sim@netnation.com)
Wed, 31 Mar 1999 20:06:47 -0800 (PST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Vineet Abraham: "Re: vremap and memcpy_to_fs in 2.2.x?"
Previous message: Ted Gervais: "Re: new kernel thiks modem is busy."

What kernel version, what compiler, and what binutils (ld -v) were used?
0x0000000d looks either like odd memory corruption or a broken compile.
Did the two machines that were screwing up get the same OOPSes exactly?
Was MTRR enabled in the config? Were all servers running the same kernel?

I doubt disabling of the IDE would affect the dentry cache in any way.
You can boot the kernel with ide0=noprobe ide1=noprobe to stop it from
touching IDE (or take it out of the kernel).

How's it going, btw? ;)

Simon-

---
On Wed, 31 Mar 1999, Rick Franchuk wrote:

> Recently, I had a contractor of mine install a five Intel boxes (PII-400s and
> PII-450s) in a provider in San Jose. Although all the pieces in all the
> machines were identical, two started producing the following oops under what
> appeared to be moderate to heavy disk usage:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 0000000b
> current->tss.cr3 = 012c7000, pr3 = 012c7000
> *pde = 00000000
> Oops: 0000
> CPU:    0
> EIP:    0010:[<c012d075>]
> EFLAGS: 00010292
> eax: 00001960   ebx: fffffff3   ecx: 49913b2c edx: 49913fb4
> esi: c020d394   edi: 00000001   ebp: 0000000b esp: c54c5f38
> ds: 0018   es: 0018   ss: 0018
> Process httpd (pid: 17592, process nr: 58, stackpage=c54c5000)
> Stack: 00000001 c2355c00 c020d394 c301301d 874e0363 0000000e c01288b4 c2355c00
>        c54c5f80 c54c5f80 c0128ae0 c2355c00 c54c5f80 c3013000 c3013000 00000001
>        bffffbd0 c3013000 c301301d 0000000e 874e0363 c0128bc5 c3013000 00000000
> Call Trace: [<c01288b4>] [<c0128ae0>] [<c0128bc5>] [<c0126caf>] [<c0107a40>]
> Code: 8b 6d 00 8b 74 24 18 39 73 48 75 eb 8b 74 24 24 39 73 0c 75
> 
> >>EIP: c012d075 <d_lookup+65/dc>
> Trace: c01288b4 <cached_lookup+10/4c>
> Trace: c0128ae0 <lookup_dentry+fc/1b8>
> Trace: c0128bc5 <__namei+29/5c>
> Trace: c0126caf <sys_newstat+13/64>
> Trace: c0107a40 <system_call+34/38>
> Code:  c012d075 <d_lookup+65/dc>               00000000 <_EIP>: <===
> Code:  c012d075 <d_lookup+65/dc>                  0:    8b 6d 00        movl 0x0(%ebp),%ebp <===
> Code:  c012d078 <d_lookup+68/dc>                  3:    8b 74 24 18     movl 0x18(%esp,1),%esi
> Code:  c012d07c <d_lookup+6c/dc>                  7:    39 73 48        cmpl %esi,0x48(%ebx)
> Code:  c012d07f <d_lookup+6f/dc>                  a:    75 eb           jne c012d06c <d_lookup+5c/dc>
> Code:  c012d081 <d_lookup+71/dc>                  c:    8b 74 24 24     movl 0x24(%esp,1),%esi
> Code:  c012d085 <d_lookup+75/dc>                 10:    39 73 0c        cmpl %esi,0xc(%ebx)
> Code:  c012d088 <d_lookup+78/dc>                 13:    75 00           jne c012d08a <d_lookup+7a/dc>
> 
> A numer of oopses would happen in rapid succession, followed by segfaults of
> whatever happened to be running and 'cannot fork()' messages streaming down
> the screen locally (I never saw them though... I'm in vancouver, so I can't 
> detail exactly what was on the screen if it wasn't logged).
> 
> Curiously, the machine also exhibited the following during boot up (Which was
> annoying, because the 'timeouts' involved were fairly long):
> 
> hda: no response (status = 0xa1), resetting drive
> hda: no response (status = 0xa1)
> hdb: no response (status = 0xa1), resetting drive
> hdb: no response (status = 0xa1)
> hdc: no response (status = 0xa1), resetting drive
> hdc: no response (status = 0xa1)
> hdd: no response (status = 0xa1), resetting drive
> hdd: no response (status = 0xa1)
> 
> I have a feeling that this is significant, as once I was able to get our man
> in Cali to completely disable all onboard IDE controllers (we run 100% SCSI
> using Adaptec 2940UWs, but the OOPSen flared up when on an NCR53c875 we
> decided to test), the oops now SEEM to have totally dissolved... I'm writing
> in hopes that it could be confirmed that this is indeed the source of the
> error (to let me sleep sounder at night) and if it's a specific board-related
> issue I can find out the model number so you all can avoid it. ;) 
> 
> --
>   __________________________________________
>  |                                          |
>  |  Rick Franchuk  -  TranSpecT Consulting  |
>  |_______                            _______|
>          \mailto:rickf@transpect.net/
>           \_____ICQ_#_4435025______/
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Vineet Abraham: "Re: vremap and memcpy_to_fs in 2.2.x?"
Previous message: Ted Gervais: "Re: new kernel thiks modem is busy."