Oops in 2.2.x

Bruce Harada (bruce@ask.ne.jp)
Fri, 14 May 1999 01:00:15 +0900


To:linux-kernel@vger.rutgers.edu
cc:harada@obunsha.co.jp
Subject: An oops a day...
--------

Hello.

Since upgrading to 2.2, I get an oops after between 15 minutes and 2 days of uptime.

The most frequent situation:
- Memory usage (including cache and buffers) up to around 90%
- Some swap used
- Medium to heavy disk usage
- Running X
With this situation, I can usually depend on getting an oops in less than fifteen minutes, and if disk usage suddenyl jumps (such as during locatedb updates), an oops will almost always occur immediately. HOWEVER, I will always get an oops after a day or two even if the machine is left as it is after a reboot, with no users logged in.

Unfortunately, the oops does not always make it to syslog, but I now have another machine connected by serial console, giving me a much better capture rate.

I have tried the following kernels:
2.2.1 (SMP)
2.2.1-ac1 (SMP)
2.2.1-ac1 (UP)
2.2.3-ac4 (SMP)
2.2.4-ac1 (SMP)
2.2.5-ac3 (SMP)
2.2.5-ac7 (SMP)
2.2.5-ac7 (UP)
2.2.6-ac1 (SMP)
2.2.7 (SMP)
2.2.7 (UP)

There was no discernable improvement with any of these.

The system is as follows:

Hardware:
SuperMicro P6DBE dual motherboard
PII 400MHz x 2 (not overclocked)
256MB PC100 RAM
Tekram DC-290F (53c875) UW-SCSI
Quantum 9GB x 1
24x CD-ROM
Tulip-based NIC (21041) 10Base-T

Software:
Slackware 3.6 (libc5)
gcc 2.7.2.3
modutils 2.1.121
binutils 2.8.1.0.23
libc 5.4.46
glibc 2.0.7
procps 1.2.7

The output of ksymoops for the last few oops (2.2.7) is:

-------------------------------------------------------------------

Options used: -V (default)
-o /lib/modules/2.2.7/ (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-m /usr/src/linux/System.map (default)
-c 1 (default)

You did not tell me where to find symbol information. I will assume
that the log matches the kernel and modules that are running right now
and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Unable to handle kernel NULL pointer dereference at virtual address 00000068
current->tss.cr3 = 0883f000, %cr3 = 0883f000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c01340a2>]
EFLAGS: 00010217
eax: 00000000 ebx: 0003c474 ecx: c022dbbc edx: 00000000
esi: c0086800 edi: c0086800 ebp: c73bd460 esp: c8a8bf08
ds: 0018 es: 0018 ss: 0018
Process find (pid: 11763, process nr: 29, stackpage=c8a8b000)
Stack: c013432f c0086800 0003c474 c022dbbc 0003c474 c00544c0 c7356000 00000000
c013fb70 c0086800 0003c474 c00544c0 c7356000 c8a8bf80 cd66af48 c012e5c3
c7356000 c00544c0 c8a8bf80 cb248007 00000000 c8a8bf80 c012e78a c73bd460
Call Trace: [<c013432f>] [<c013fb70>] [<c012e5c3>] [<c012e78a>] [<c012e855>] [<c012c89a>] [<c0108c14>]
Code: 39 70 68 75 0d 39 58 18 75 08 ff 40 1c 5b 5e c3 89 f6 8b 12

>>EIP: c01340a2 <find_inode+16/34>
Trace: c013432f <iget+3f/84>
Trace: c013fb70 <ext2_lookup+5c/90>
Trace: c012e5c3 <real_lookup+4b/80>
Trace: c012e78a <lookup_dentry+10a/1ac>
Trace: c012e855 <__namei+29/5c>
Trace: c012c89a <sys_newlstat+2e/94>
Trace: c0108c14 <system_call+34/40>
Code: c01340a2 <find_inode+16/34> 00000000 <_EIP>: <===
Code: c01340a2 <find_inode+16/34> 0: 39 70 68 cmpl %esi,0x68(%eax) <===
Code: c01340a5 <find_inode+19/34> 3: 75 0d jne c01340b4 <find_inode+28/34>
Code: c01340a7 <find_inode+1b/34> 5: 39 58 18 cmpl %ebx,0x18(%eax)
Code: c01340aa <find_inode+1e/34> 8: 75 08 jne c01340b4 <find_inode+28/34>
Code: c01340ac <find_inode+20/34> a: ff 40 1c incl 0x1c(%eax)
Code: c01340af <find_inode+23/34> d: 5b popl %ebx
Code: c01340b0 <find_inode+24/34> e: 5e popl %esi
Code: c01340b1 <find_inode+25/34> f: c3 ret
Code: c01340b2 <find_inode+26/34> 10: 89 f6 movl %esi,%esi
Code: c01340b4 <find_inode+28/34> 12: 8b 12 movl (%edx),%edx

Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 01c99000, %cr3 = 01c99000
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0133381>]
EFLAGS: 00010282
eax: 00000390 ebx: ffffffe8 ecx: c0358c72 edx: c0358072
esi: 00003652 edi: 00000001 ebp: 00000000 esp: c1fb5f38
ds: 0018 es: 0018 ss: 0018
Process top (pid: 548, process nr: 32, stackpage=c1fb5000)
Stack: 00000001 c0054ac0 c022bbf4 c6fa2006 00003652 00000003 c012e53c c0054ac0
c1fb5f80 c1fb5f80 c012e77c c0054ac0 c1fb5f80 c6fa2000 c6fa2000 00000001
bffff8b4 c6fa2000 c6fa2006 00000003 00003652 c012e855 c6fa2000 00000000
Call Trace: [<c012e53c>] [<c012e77c>] [<c012e855>] [<c012c772>] [<c0108c14>] [<c010002b>]
Code: 8b 6d 00 8b 74 24 18 39 73 48 75 eb 8b 74 24 24 39 73 0c 75

>>EIP: c0133381 <d_lookup+65/dc>
Trace: c012e53c <cached_lookup+10/4c>
Trace: c012e77c <lookup_dentry+fc/1ac>
Trace: c012e855 <__namei+29/5c>
Trace: c012c772 <sys_newstat+2e/94>
Trace: c0108c14 <system_call+34/40>
Trace: c010002b <startup_32+2b/a4>
Code: c0133381 <d_lookup+65/dc> 00000000 <_EIP>: <===
Code: c0133381 <d_lookup+65/dc> 0: 8b 6d 00 movl 0x0(%ebp),%ebp <===
Code: c0133384 <d_lookup+68/dc> 3: 8b 74 24 18 movl 0x18(%esp,1),%esi
Code: c0133388 <d_lookup+6c/dc> 7: 39 73 48 cmpl %esi,0x48(%ebx)
Code: c013338b <d_lookup+6f/dc> a: 75 eb jne c0133378 <d_lookup+5c/dc>
Code: c013338d <d_lookup+71/dc> c: 8b 74 24 24 movl 0x24(%esp,1),%esi
Code: c0133391 <d_lookup+75/dc> 10: 39 73 0c cmpl %esi,0xc(%ebx)
Code: c0133394 <d_lookup+78/dc> 13: 75 00 jne c0133396 <d_lookup+7a/dc>

Unable to handle kernel paging request at virtual address 11b2e2f6
current->tss.cr3 = 07690000, %cr3 = 07690000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0133381>]
EFLAGS: 00010282
eax: 00000948 ebx: 11b2e2de ecx: 2b959129 edx: 2b959390
esi: 675793cd edi: 00000000 ebp: 11b2e2f6 esp: c7737f38
ds: 0018 es: 0018 ss: 0018
Process find (pid: 2622, process nr: 38, stackpage=c7737000)
Stack: 00000000 c447e080 c022c1ac c3e21000 675793cd 0000000a c012e53c c447e080
c7737f80 c7737f80 c012e77c c447e080 c7737f80 c3e21000 c3e21000 00000000
bffff600 c3e21000 c3e21000 0000000a 675793cd c012e855 c3e21000 00000000
Call Trace: [<c012e53c>] [<c012e77c>] [<c012e855>] [<c012c89a>] [<c0108c14>] [<c010002b>]
Code: 8b 6d 00 8b 74 24 18 39 73 48 75 eb 8b 74 24 24

>>EIP: c0133381 <d_lookup+65/dc>
Trace: c012e53c <cached_lookup+10/4c>
Trace: c012e77c <lookup_dentry+fc/1ac>
Trace: c012e855 <__namei+29/5c>
Trace: c012c89a <sys_newlstat+2e/94>
Trace: c0108c14 <system_call+34/40>
Trace: c010002b <startup_32+2b/a4>
Code: c0133381 <d_lookup+65/dc> 00000000 <_EIP>: <===
Code: c0133381 <d_lookup+65/dc> 0: 8b 6d 00 movl 0x0(%ebp),%ebp <===
Code: c0133384 <d_lookup+68/dc> 3: 8b 74 24 18 movl 0x18(%esp,1),%esi
Code: c0133388 <d_lookup+6c/dc> 7: 39 73 48 cmpl %esi,0x48(%ebx)
Code: c013338b <d_lookup+6f/dc> a: 75 eb jne c0133378 <d_lookup+5c/dc>
Code: c013338d <d_lookup+71/dc> c: 8b 74 24 24 movl 0x24(%esp,1),%esi

721 warnings issued. Results may not be reliable.

--------------------------------------------------

These are three separate incidents, with at least one reboot between each of them. The first one is choking at the following point in /usr/src/linux/fs/inode.c:

0000086c <find_inode>:
86c: 56 pushl %esi
86d: 53 pushl %ebx
86e: 8b 74 24 0c movl 0xc(%esp,1),%esi
872: 8b 5c 24 10 movl 0x10(%esp,1),%ebx
876: 8b 4c 24 14 movl 0x14(%esp,1),%ecx
87a: 8b 11 movl (%ecx),%edx
87c: eb 18 jmp 896 <find_inode+0x2a>
87e: 89 f6 movl %esi,%esi
880: 89 d0 movl %edx,%eax
882: 39 70 68 cmpl %esi,0x68(%eax) <--- Here
885: 75 0d jne 894 <find_inode+0x28>
887: 39 58 18 cmpl %ebx,0x18(%eax)
88a: 75 08 jne 894 <find_inode+0x28>
88c: ff 40 1c incl 0x1c(%eax)
88f: 5b popl %ebx
890: 5e popl %esi
891: c3 ret
892: 89 f6 movl %esi,%esi
894: 8b 12 movl (%edx),%edx
896: 31 c0 xorl %eax,%eax
898: 39 ca cmpl %ecx,%edx
89a: 75 e4 jne 880 <find_inode+0x14>
89c: 5b popl %ebx
89d: 5e popl %esi
89e: c3 ret
89f: 90 nop

/*
* Called with the inode lock held.
*/
static struct inode * find_inode(struct super_block * sb, unsigned long ino, struct list_head *head)
{
struct list_head *tmp;
struct inode * inode;

tmp = head;
for (;;) {
tmp = tmp->next;
inode = NULL;
if (tmp == head) <--- Here
break;
inode = list_entry(tmp, struct inode, i_hash);
if (inode->i_sb != sb)
continue;
if (inode->i_ino != ino)
continue;
inode->i_count++;
break;
}
return inode;
}

The second and third oops both occur in the same place in /usr/src/linux/fs/dcache.c:

00000798 <d_lookup>:
798: 83 ec 10 subl $0x10,%esp
79b: 55 pushl %ebp
79c: 57 pushl %edi
79d: 56 pushl %esi
79e: 53 pushl %ebx
79f: 8b 74 24 28 movl 0x28(%esp,1),%esi
7a3: 8b 76 04 movl 0x4(%esi),%esi
7a6: 89 74 24 1c movl %esi,0x1c(%esp,1)
7aa: 8b 74 24 28 movl 0x28(%esp,1),%esi
7ae: 8b 76 08 movl 0x8(%esi),%esi
7b1: 89 74 24 18 movl %esi,0x18(%esp,1)
7b5: 8b 74 24 28 movl 0x28(%esp,1),%esi
7b9: 8b 36 movl (%esi),%esi
7bb: 89 74 24 14 movl %esi,0x14(%esp,1)
7bf: 8b 4c 24 18 movl 0x18(%esp,1),%ecx
7c3: 03 4c 24 24 addl 0x24(%esp,1),%ecx
7c7: 89 c8 movl %ecx,%eax
7c9: c1 e8 0a shrl $0xa,%eax
7cc: 89 ca movl %ecx,%edx
7ce: 31 c2 xorl %eax,%edx
7d0: 89 c8 movl %ecx,%eax
7d2: c1 e8 14 shrl $0x14,%eax
7d5: 89 d1 movl %edx,%ecx
7d7: 31 c1 xorl %eax,%ecx
7d9: 89 c8 movl %ecx,%eax
7db: 25 ff 03 00 00 andl $0x3ff,%eax
7e0: c1 e0 03 shll $0x3,%eax
7e3: 8d b0 00 00 00 leal 0x0(%eax),%esi
7e8: 00
7e9: 89 74 24 10 movl %esi,0x10(%esp,1)
7ed: 8b a8 00 00 00 movl 0x0(%eax),%ebp
7f2: 00
7f3: 90 nop
7f4: 8d 5d e8 leal 0xffffffe8(%ebp),%ebx
7f7: 39 6c 24 10 cmpl %ebp,0x10(%esp,1)
7fb: 74 6b je 868 <d_lookup+0xd0>
7fd: 8b 6d 00 movl 0x0(%ebp),%ebp <--- Here
800: 8b 74 24 18 movl 0x18(%esp,1),%esi
804: 39 73 48 cmpl %esi,0x48(%ebx)
807: 75 eb jne 7f4 <d_lookup+0x5c>
809: 8b 74 24 24 movl 0x24(%esp,1),%esi
80d: 39 73 0c cmpl %esi,0xc(%ebx)
810: 75 e2 jne 7f4 <d_lookup+0x5c>
. .
. .
. .

struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
unsigned int len = name->len;
unsigned int hash = name->hash;
const unsigned char *str = name->name;
struct list_head *head = d_hash(parent,hash);
struct list_head *tmp = head->next;

for (;;) {
struct dentry * dentry = list_entry(tmp, struct dentry, d_hash);
Here ---> if (tmp == head)
break;
tmp = tmp->next;
if (dentry->d_name.hash != hash)
continue;
if (dentry->d_parent != parent)
continue;
if (parent->d_op && parent->d_op->d_compare) {
if (parent->d_op->d_compare(parent, &dentry->d_name, name))
continue;
} else {
if (dentry->d_name.len != len)
continue;
if (memcmp(dentry->d_name.name, str, len))
continue;
}
return dget(dentry);
}
return NULL;
}

What I'd like to know is if there are any currently known kernel bugs relating to this, and if there are not, what causes OTHER than a kernel bug could be possible.
I'm willing to provide to provide any other information needed (dmesg output, /proc contents, kernel config, etc.)

Thanks for your help,

Bruce Harada
bruce@ask.ne.jp

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/