Re: Sporadious hang on 2.0.3[0,1,2,3,4pre2]

Manfred Petz (pm@radawana.cg.tuwien.ac.at)
Fri, 6 Mar 1998 13:35:15 +0100 (CET)


> > .... couldn't get a free skbuff ...
> > .... couldn't get a free page ...
> >
> > There was no output from the debug-skbuff in the logs.
>
> That one is a machine running totally out of memory. Would it be reasonable
> to expect it to run out of memory ?
>

No, the machine is almost idle all the time. Somebody noted CONFIG_BRIDGE
- yes I'd enabled this (don't know why, must be a mistake) and it seems
that it reveals a memory leak on my machine, but I haven't verified that
further.

I've compiled the kernel again without CONFIG_BRIDGE and tortured it
heavily for a couple of hours. From another Linux box I did connects
to the www, smtp and ftp ports of my server simultaneously. About 150
processes on the remote fired connect-requests to my Linux box.

After about 3 hours the machine crashed. skbuff is at least involved here.
Here's the trace:

Unable to handle kernel NULL pointer dereference at virtual address c0000008
current->tss.cr3 = 0046f000, @r3 = 0046f000
*pde = 00102067
*pte = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[tcp_recvmsg+609/1088]
EFLAGS: 00010246
eax: 00000000 ebx: 00c57cd4 ecx: 00000022 edx: 001a0e1c
esi: 00000246 edi: 00000000 ebp: 00c57c0c esp: 0082fee8
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process httpd (pid: 12572, process nr: 24, stackpage=0082f000)
Stack: 00c57c0c 0082ff7c 00000000 00000000 00000000 00c57c30 00000000 00000000
011f7414 0067c948 6cb2503e 00159216 00c57c0c 0082ff78 00001000 00000000
00000000 0082ff7c 00001000 0067c900 080889b4 0067c990 0013b11b 0067c990
Call Trace: [inet_recvmsg+118/144] [sock_read+171/208] [sys_read+204/256] [system_call+85/128]
Code: 81 60 08 ff ff fd ff ff 45 44 8b 4c 24 14 e9 2c 01 00 00 8d
Unable to handle kernel NULL pointer dereference at virtual address c0000008
current->tss.cr3 = 0146d000, @r3 = 0146d000
*pde = 00102067

(gdb) disassemble tcp_recvmsg
Dump of assembler code for function tcp_recvmsg:
[SNIP]
0x14d61e <tcp_recvmsg+574>: addl $0x4,%esp
0x14d621 <tcp_recvmsg+577>: movl 0x14(%esp,1),%ecx
0x14d625 <tcp_recvmsg+581>: movl 0x214(%ebp),%eax
0x14d62b <tcp_recvmsg+587>: orl $0x20000,0x8(%eax)
0x14d632 <tcp_recvmsg+594>: movl %ecx,0x14(%esp,1)
0x14d636 <tcp_recvmsg+598>: call 0x112000 <schedule>
0x14d63b <tcp_recvmsg+603>: movl 0x214(%ebp),%eax
0x14d641 <tcp_recvmsg+609>: andl $0xfffdffff,0x8(%eax)
0x14d648 <tcp_recvmsg+616>: incl 0x44(%ebp)
0x14d64b <tcp_recvmsg+619>: movl 0x14(%esp,1),%ecx
0x14d64f <tcp_recvmsg+623>: jmp 0x14d780 <tcp_recvmsg+928>
0x14d654 <tcp_recvmsg+628>: leal 0x0(%esi),%esi
0x14d65a <tcp_recvmsg+634>: leal 0x0(%edi),%edi
0x14d660 <tcp_recvmsg+640>: incw 0x6e(%ebx)
0x14d664 <tcp_recvmsg+644>: movl 0x34(%ebx),%esi
0x14d667 <tcp_recvmsg+647>: subl 0x10(%esp,1),%esi
0x14d66b <tcp_recvmsg+651>: cmpl %esi,0x38(%esp,1)
0x14d66f <tcp_recvmsg+655>: jae 0x14d675 <tcp_recvmsg+661>

Which is tcp.c around line 1712:

cleanup_rbuf(sk);
release_sock(sk);
sk->socket->flags |= SO_WAITDATA;
schedule();
sk->socket->flags &= ~SO_WAITDATA;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
lock_sock(sk);
continue;

How can that happen? Could it be a hardware-interrupt-handler?
Maybe this explains all those strange problems? I've a 3c509 if this
matters.

Does this help? Do you need more information? Any additional
debugging-code I can apply? :-)

I'm currently trying to reproduce that Oops, I'm convinced it happens
again very soon. Note that this one is different to the problems which usually
show up (complete lock-up).

pm

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu