[Bug 1627] New: system crashes after 3 hours test

From: Martin J. Bligh
Date: Tue Dec 02 2003 - 11:53:30 EST


http://bugme.osdl.org/show_bug.cgi?id=1627

Summary: system crashes after 3 hours test.
Kernel Version: 2.6.0-test9
Status: NEW
Severity: high
Owner: bugme-janitors@xxxxxxxxxxxxxx
Submitter: dvnguyen@xxxxxxxxxx
CC: wmb@xxxxxxxxxx


Distribution:
Hardware Environment:
pSeries p650
Software Environment:
2.6.0-test9
Problem Description:
Ran SPECweb99_SSL benchmark test for 3 hours and system crashed .
Here are some information about xmon:
0:mon> t
c0000007fc70fd00 c00000000035ddfc .tcp_do_twkill_work+0x19c/0x1b0
c0000007fc70fdd0 c00000000035e064 .twkill_work+0x11c/0x1b4
c0000007fc70fe80 c00000000006457c .worker_thread+0x280/0x3b8
c0000007fc70ff90 c000000000017d4c .kernel_thread+0x4c/0x68
0:mon>
0:mon> r
R00 = 0000000000000001 R16 = 0000000000000000
R01 = c0000007fc70fd00 R17 = 0000000000000000
R02 = c000000000679000 R18 = 0000000000000000
R03 = c0000007fc2a5b80 R19 = 0000000000000000
R04 = c0000007fc2a4000 R20 = 0000000000c00000
R05 = 0000000000000000 R21 = 0000000000000000
R06 = c0000000005ec880 R22 = c000000000745ce8
R07 = c0000007f9000000 R23 = 0000000000000064
R08 = 00000000000d4c50 R24 = 0000000000000000
R09 = 0000000000000000 R25 = 0000000000000001
R10 = 0000000000000001 R26 = 0000000000000001
R11 = c0000007fc2a4010 R27 = c00000065069aef8
R12 = 0000000024000080 R28 = c00000062d56acf8
R13 = c0000000005aa000 R29 = c0000000004ea428
R14 = 0000000000000000 R30 = c0000000005927e8
R15 = 0000000000000000 R31 = c00000062d56ac80
pc = c00000000035dce0 msr = 9000000000009032
lr = c00000000035ddfc cr = 0000000084008080
ctr = 0000000000000000 xer = 0000000020000000 trap = 300
0:mon> S
msr = 9000000000001032 sprg0= 0000000000000000
pvr = 0000000000380201 sprg1= 0000000000000000
dec = 000000003f96aab1 sprg2= 0000000000c00000
sp = c0000007fc70f560 sprg3= c0000000005aa000
toc = c000000000679000 dar = 0000000000000000
srr0 = c00000000000a888 srr1 = 9000000000001032
asr = 0000000000009001
sr00 = 0000000000000053 sr08 = 0000000000000053
sr01 = 0000000000000053 sr09 = 0000000000000053
sr02 = 0000000000000053 sr10 = 0000000000000053
sr03 = 0000000000000053 sr11 = 0000000000000053
sr04 = 0000000000000053 sr12 = 0000000000000053
sr05 = 0000000000000053 sr13 = 0000000000000053
sr06 = 0000000000000053 sr14 = 0000000000000053
sr07 = 0000000000000053 sr15 = 0000000000000053
Paca:
Local Processor Control Area (LpPaca):
Saved Srr0=0000000000000000 Saved Srr1=0000000000000000
Saved Gpr3=0000000000000000 Saved Gpr4=0000000000000000
Saved Gpr5=0000000000000000
Local Processor Register Save Area (LpRegSave):
Saved Sprg0=0000000000000000 Saved Sprg1=0000000000000000
Saved Sprg2=0000000000000000 Saved Sprg3=0000000000000000
Saved Msr =0000000000000000 Saved Nia =0000000000000000
0:mon> e
cpu 0: Vector: 300 (Data Access) at [c0000007fc70fa80]
pc: c00000000035dce0 (.tcp_do_twkill_work+0x80/0x1b0)
lr: c00000000035ddfc (.tcp_do_twkill_work+0x19c/0x1b0)
sp: c0000007fc70fd00
msr: 9000000000009032
dar: 0
dsisr: 42000000
current = 0xc0000007fc7547b8
paca = 0xc0000000005aa000
pid = 10, comm = events/0
0:mon> s
Oops: Kernel access of bad area, sig: 11 [#1]
NIP: C00000000035DCE0 XER: 0000000020000000 LR: C00000000035DDFC
REGS: c0000007fc70fa80 TRAP: 0300 Not tainted
MSR: 9000000000009432 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 0000000000000000, DSISR: 0000000042000000
TASK = c0000007fc7547b8[10] 'events/0' CPU: 0
GPR00: 0000000000000001 C0000007FC70FD00 C000000000679000 C0000007FC2A5B80
GPR04: C0000007FC2A4000 0000000000000000 C0000000005EC880 C0000007F9000000
GPR08: 00000000000D4C50 0000000000000000 0000000000000001 C0000007FC2A4010
GPR12: 0000000024000080 C0000000005AA000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000C00000 0000000000000000 C000000000745CE8 0000000000000064
GPR24: 0000000000000000 0000000000000001 0000000000000001 C00000065069AEF8
GPR28: C00000062D56ACF8 C0000000004EA428 C0000000005927E8 C00000062D56AC80
NIP [c00000000035dce0] .tcp_do_twkill_work+0x80/0x1b0
Call Trace:
[c00000000035e064] .twkill_work+0x11c/0x1b4
[c00000000006457c] .worker_thread+0x280/0x3b8
[c000000000017d4c] .kernel_thread+0x4c/0x68
<0>Kernel panic: Fatal exception in interrupt
In interrupt handler - not syncing
<0>Rebooting in 180 seconds..
=============================================

Quote here some debug info:
"I disassembled the kernel around where the crash occurs, and compared that to
the source code. It's a little hard to follow due to the inlining, but I think
I see where in the source the crash is occurring.

tcp_do_twkill_work calls __tw_del_dead_node(tw), which calls __hlist_del(&tw-
> tw_death_node). I think the crash occurs in __hlist_del, at the line shown
below.

static __inline__ void __hlist_del(struct hlist_node *n)
{
struct hlist_node *next = n->next;
struct hlist_node **pprev = n->pprev;
*pprev = next; <<<<<<---------- crash occurs here
if (next)
next->pprev = pprev;
}

The corresponding assembly code looks as follows:

c000000000376380: eb 7c 00 00 ld r27,0(r28)
c000000000376384: e9 3c 00 08 ld r9,8(r28)
c000000000376388: 3b bc ff 88 addi r29,r28,-120
c00000000037638c: 2e 3b 00 00 cmpdi cr4,r27,0
c000000000376390: fb 69 00 00 std r27,0(r9) <<<---- crashes here
c000000000376394: 41 92 00 08 beq- cr4,c00000000037639c
c000000000376398: f9 3b 00 08 std r9,8(r27)
"
"The xmon output shows that r9 == 0. Linking this back to the source code, this
means that pprev == n->pprev == NULL in hlist_del."
"

I'll test the latest kernel (test11) and will have some infor posted back here.

Steps to reproduce:
Need to run SPECweb99_SSL benchmark to reproduce problem.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/