Hard tcp_keepalive crash: 2.2.15pre10 continues tradition

From: Whit Blauvelt (whit@transpect.com)
Date: Sat Feb 26 2000 - 11:38:10 EST


Yup, the new tcp code in pre10 didn't fix it - system crashed after about
25 hours:

[1.] One line summary of the problem:

Hard crashes of system related to tcp_keepalive

[2.] Full description of the problem/report:

The following frozen screen occurring between once a week and once a
day (with only the most minor differences, has appeared running kernels
1.2.13, 1.2.14 and now 1.2.15pre9 and pre10, through replacement of the
memory, motherboard, CPU, brand of NICs, video card, and drive cables
(what's constant now is the Tekram DC-390F SCSI controller, the IBM SCSI
HD, a Western Digital IDE HD used for /boot and swap, and the case and
power supply):

[3.] Keywords (i.e., modules, networking, kernel):

tcp_keepalive crash

[4.] Kernel version (from /proc/version):

Linux version 2.2.15pre10 (root@china.patternbook.com) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 Fri Feb 25 08:5
3:40 EST 2000

[5.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/oops-tracing.txt)

Unable to handle kernel NULL pointer dereference at virtual address 00000050
Oops: 0000
CPU: 0
EIP: 0010 [<c01783b8>]
EFLAGS: 000010282
eax: 00000000 ebx: c2fffa40 ecx: c2fffca8 edx: 0000000d
esi: 00000001 edi: 00000000 ebp: 00006188 esp: c021fef4
ds: 0018 es: 0018 ss: 0018
Stack: 00000007 00000000 c0212a30 00000001 00000020 0086e20d 00000000 00000000
        c0178742 c0212a30 00000000 c0178710 0000000d c021ff4c c0111261 00000000
        00000001 c025c640 0086e1e2 00000001 00000000 c021e000 c021ff60 c011797d
Call Trace: [<c0178742>] [<c0178710>] [<c0111261>] [<c011797d>] [<c010alc1>] [<c0109e28>] [<c0107851>]
                [<c0106000>] [<c0107874>] [<c0108ff8>] [<c0106000>] [<c010607b>] [<c0106000>] [<c0100175>]
Code: 8b 40 50 ff d0 83 c4 10 83 fe 01 75 06 ff 0d 40 e5 25 c0 66

>>EIP: c01783b8 <tcp_keepalive+e8/188>
Trace: c0178742 <tcp_sltimer_handler+32/70>
Trace: c0178710 <tcp_sltimer_handler+0/70>
Trace: c0111261 <timer_bh+331/388>
Trace: c011797d <do_bottom_half+45/64>
Trace: c0106000 <get_options+0/74>
Code: c01783b8 <tcp_keepalive+e8/188> 00000000 <_EIP>: <===
Code: c01783b8 <tcp_keepalive+e8/188> 0: 8b 40 50 mov 0x50(%eax),%eax <===
Code: c01783bb <tcp_keepalive+eb/188> 3: ff d0 call *%eax
Code: c01783bd <tcp_keepalive+ed/188> 5: 83 c4 10 add $0x10,%esp
Code: c01783c0 <tcp_keepalive+f0/188> 8: 83 fe 01 cmp $0x1,%esi
Code: c01783c3 <tcp_keepalive+f3/188> b: 75 06 jne c01783cb <tcp_keepalive+fb/188>
Code: c01783c5 <tcp_keepalive+f5/188> d: ff 0d 40 e5 25 c0 decl 0xc025e540
Code: c01783cb <tcp_keepalive+fb/188> 13: 66 data16

Aiee, killing interrupt handler
Kernel panic: Attemted to kill the idle task:
In swapper task - not syncing

[6.] A small shell script or example program which triggers the
problem (if possible)

Not a clue. Happens at different times of day, in different load
conditions - more common though when masqueraded systems are not on.

[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)

Linux china.patternbook.com 2.2.15pre10 #1 Fri Feb 25 08:53:40 EST 2000
i586 unknown Kernel modules 2.3.9
Gnu C egcs-2.91.66
Binutils 2.9.4.0.6
Linux C Library 2.1.1
Dynamic linker ldd (GNU libc) 2.1.1
Procps 2.0.2
Mount 2.9o
Net-tools 1.52
Console-tools 1999.03.02
Sh-utils 1.16
Modules Loaded 3c59x

[7.2.] Processor information (from /proc/cpuinfo):

processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 367.512290 (it's a 450 but I'm clocking it slow)
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mmx 3dnow
bogomips : 732.36

Note this also was happenning with an AMD 233

[7.3.] Module information (from /proc/modules):

3c59x 19332 2 (autoclean)

Note this was also happenning with 2 LinkSys (tulip) NICs rather than the 2
3Com now in use

[7.4.] SCSI information (from /proc/scsi/scsi)

General information:
Chip sym53c875E, device id 0xf, revision id 0x26
IO port address 0xc000, IRQ number 5
Using memory mapped IO at virtual address 0xc4000000
Synchronous period factor 12, max commands per lun 32

[7.5.] Other information that might be relevant to the problem
(please look in /proc and include all information that you
think to be relevant):

The system is a DSL connected firewall (connected to SpeedStream bridge
to Covad to AT&T ATM) with masquerading for two other systems (although it
fails whether they are active or off), running DNS (bind 8.2.2 5), Apache,
Proftpd, sendmail, MySQL.

[X.] Other notes, patches, fixes, workarounds:

Didn't tcp_keepalive used to be a compile-time option in the kernel? What
would I do to compile a current kernel without this 'feature'? Would I
miss it? This is a production system that I can't man 24x7, so any
kludged fix is preferable to continuing to have to check it constantly to
stay up.

Please cc me on responses, I'm not subscribed to linux-kernel. Thanks.

Whit Blauvelt
whit@transpect.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Feb 29 2000 - 21:00:15 EST