Oops - hard crash in 2.2.15 - tcp_keepalive (again!!)

From: Whit Blauvelt (whit@transpect.com)
Date: Tue May 16 2000 - 19:03:39 EST


[1.] One line summary of the problem:

2.2.15 Oops - tcp_keepalive (again!!)

[2.] Full description of the problem/report:

As I reported last week:

"This is a system for which I've reported TCP related crashes on kernels back
to 2.2.13. However, every single bit of hardware has been changed along the
way (one part at a time) aside from the Tekram SCSI controller and the SCSI
hard drive and CD drive, which don't seem suspects in this. Saw similar
crash in 2.2.15pre20, but didn't have time to copy screen on that. That was
with kernel 3com driver, now is with 3com's 3c90x. Crashes are much less
frequent now - it's made it almost two weeks (but twice one day recently
under 2.2.15pre20). The current crash was after almost a week. It's not that
busy a system, but is running Apache, sendmail, bind 8, proftpd, ipchains,
and masquerading (for two other boxes), and answering to 8 outside IPs. I
have two similar systems elsewhere in terms of software configuration and
function that have never crashed over some months, but those carry lighter
loads even than this one, one with 2.2.13 and one with 2.2.14. The current
crash is in a different segment of TCP code than earlier ones (used to be in
tcp_keepalive), probably peeling an onion here to get to the central bug."

This present crash is with 2.2.15 rather than 2.2.16pre2. I also applied
Andrea's delack-timer-5 to it this time around (had a prior crash in
vanilla 2.2.15 without that though). Went 6 days to the crash this time -
about average, so that evidently wasn't it, although this is pointing
straight at tcp_keepalive again, like the older crashes, except not a "NULL
pointer exception" - the one last week was slightly different, but still in
tcp code.

[3.] Keywords (i.e., modules, networking, kernel):

kernel 2.2.15

[4.] Kernel version (from /proc/version):

Linux version 2.2.15 (root@china.patternbook.com) (gcc version 2.7.2.3) #1 Thu May 11 15:38:25 EDT 2000

[5.] Output of Oops.. message (if applicable) with symbolic information
     resolved (see Documentation/oops-tracing.txt) [this is copied by hand
     again, sigh]

Unable to handle kernel paging request at virtual address 54796c86
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0179278>]
EFLAGS: 00010206
eax: c7e80000 ebx: 54796c66 ecx: c0203198 edx: 54796c66
esi: 00000001 edi: 00000006 ebp: 0000437f esp: c0215000
ds: 0018 es: 0018 ss: 0018
Stack: 00000006 01600dd8 54796c66 00000000 c0179699 c0209650 00000000 c0179668
        00000001 c0215f48 c01113a9 00000000 00000001 c0252384 00000000 c0215f60
        c0117b99 00000000 c0214000 01600d93 c010a2cd 00000e00 c0109f9c 00000000
Call Trace: [<c0179699>] [<c0179668>] [<c01113a9>] [<c0117b99>] [<c010a2cd>] [<c0109f9c>] [<c01078a9>]
                [<c0106000>] [<c01078cc>] [<c01090fc>] [<c0106000>] [<c010607b>] [<c0106000>] [<c0100175>]
Code: 8b 53 20 89 54 24 10 83 7b 30 00 0f 85 ef 00 00 00 8a 43 77

>>EIP: c0179278 <tcp_keepalive+38/180>
Trace: c0179699 <tcp_sltimer_handler+31/70>
Trace: c0179668 <tcp_sltimer_handler+0/70>
Trace: c01113a9 <timer_bh+2e9/330>
Trace: c0117b99 <do_bottom_half+49/64>
Trace: c010a2cd <do_IRQ+39/40>
Trace: c0109f9c <common_interrupt+18/20>
Trace: c01078a9 <cpu_idle+61/70>
Trace: c0106000 <get_options+0/74>
Code: c0179278 <tcp_keepalive+38/180> 00000000 <_EIP>: <===
Code: c0179278 <tcp_keepalive+38/180> 0: 8b 53 20 mov 0x20(%ebx),%edx <===
Code: c017927b <tcp_keepalive+3b/180> 3: 89 54 24 10 mov %edx,0x10(%esp,1)
Code: c017927f <tcp_keepalive+3f/180> 7: 83 7b 30 00 cmpl $0x0,0x30(%ebx)
Code: c0179283 <tcp_keepalive+43/180> b: 0f 85 ef 00 00 00 jne c0179378 <tcp_keepalive+138/180>
Code: c0179289 <tcp_keepalive+49/180> 11: 8a 43 77 mov 0x77(%ebx),%al

Aiee, killing interrupt handler
Kernel panic: Attempted to kill the idle task!
In swapper task - not syncing

[6.] A small shell script or example program which triggers the
     problem (if possible)

In the past, setting tcp_keepalive real frequent crashed it quicker. Haven't
tried it on this present example.

[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)

Linux china.patternbook.com 2.2.15 #1 Thu May 11 15:38:25 EDT 2000 i586
unknown
Kernel modules 2.3.9
Gnu C 2.7.2.3
Binutils 2.9.4.0.6
Linux C Library 2.1.1
Dynamic linker ldd (GNU libc) 2.1.1
Procps 2.0.2
Mount 2.9o
Net-tools 1.52
Console-tools 1999.03.02
Sh-utils 1.16
Modules Loaded ip_masq_raudio ip_masq_ftp 3c90x

[7.2.] Processor information (from /proc/cpuinfo):

processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 451.034862
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mmx 3dnow
bogomips : 897.84

[7.3.] Module information (from /proc/modules):

ip_masq_raudio 2892 0 (unused)
ip_masq_ftp 2504 0 (unused)
3c90x 22876 2 (autoclean)

[7.4.] SCSI information (from /proc/scsi/scsi)

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: IBM Model: DORS-32160 Rev: WA6A
  Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
  Vendor: NEC Model: CD-ROM DRIVE:222 Rev: 3.0i
  Type: CD-ROM ANSI SCSI revision: 02

Anyone with any advice on how to get a stable kernel on this system, I'd
love to hear it. Again, _every_ piece of hardware other than the SCSI
drives and controller has been replaced, one piece at a time - and I also
reverted from egcs to true gcc - it's most definitely a persistent kernel
bug. Might have to do with how many different fuctions the system's doing -
but that's not negotiable in the setup, unfortunately. It's also a pretty
standard set of stuff, not anything exotic - machine's not at all
overloaded process-wise and the kernel should be able to handle it.

Whit@transpect.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue May 23 2000 - 21:00:11 EST