Re: 1.3.95 is not stable

Linus Torvalds (torvalds@cs.helsinki.fi)
Fri, 26 Apr 1996 09:16:32 +0300 (EET DST)


On 25 Apr 1996, Steven L Baur wrote:
>
> Here are 3 crash traces from syslog. The symptoms were all similar in
> that init died each time making shutdown problematic. I've downgraded
> to .93 out of self-defense. Presumably the constant network crashes
> are going to get fixed before 2.0, right?

Is this 100% network-related? Any correlation with anything else?

The strange thing is that the 94/95 patches didn't really change much of
the kernel: they contain mostly m68k stuff and spelling fixes.

The crash you see is due to memory corruption in the kernel (the function
"handle_signal()" to be exact): the code sequence _should_ be

addl $0x10,%esp
cmpl $0x0,0x8(%ebx)
jnl <handle_signal+106>

which is "0x83 0xc4 0x10 0x83 0x7b 0x08 0x00 0x7d 0x06". Your panic
reports "0x83 0x84 0x10 0x83 0x7b 0x08 0x00 0x7d 0x06".

Note the _one_bit_ error in the second byte... (0xc4 has become a 0x84).

Now, one-bit errors might be due to flaky memory, and the reason you see
the problems with some kernels and not others _may_ be because the small
differences in the kernels move the code around a bit, and then the
errors show up in different places (or fail to show up at all).

The error may well be due to a bogus kernel pointer being used for bitmap
operations too, of course. That's the more likely explanation (bogus
pointers _usually_ result in byte or word corruption rather than just a
single bit error, but this can be unlucky).

>
> Crash #1: (Uptime of about 8 hours)
> Apr 24 18:13:37 deanna kernel: kfree of non-kmalloced memory: 010049fc, next= 0081e000, order=36
> Apr 24 18:13:37 deanna kernel: kfree of non-kmalloced memory: 01004800, next= 0081e000, order=36
> Apr 24 18:13:37 deanna kernel: kfree of non-kmalloced memory: 01004408, next= 0081e000, order=36
> Apr 24 18:13:37 deanna kernel: kfree of non-kmalloced memory: 01004010, next= 0081e000, order=4
> Apr 24 18:13:37 deanna kernel: kfree of non-kmalloced memory: 01004bf8, next= 0081e000, order=4

These happened earlier, and they are pretty bad. The machine seems to
have stayed up for more than one hour after these, though.

Is this the same (or related - similar motherboard/processor/memory)
machine that had similar non-kmalloced problems earlier with various
1.3.x kernels?

> Apr 24 19:29:08 deanna kernel: Unable to handle kernel paging request at virtual address c8103e83
> Apr 24 19:29:08 deanna kernel: current->tss.cr3 = 0096c000, ^Lr3 = 0096c000
> Apr 24 19:29:08 deanna kernel: *pde = 00000000
> Apr 24 19:29:08 deanna kernel: Oops: 0000
> Apr 24 19:29:08 deanna kernel: CPU: 0
> Apr 24 19:29:08 deanna kernel: EIP: 0010:[handle_signal+91/120]
> Apr 24 19:29:08 deanna kernel: EFLAGS: 00010202
> Apr 24 19:29:08 deanna kernel: eax: 0806c300 ebx: 00976914 ecx: 00000000 edx: 00010000
> Apr 24 19:29:08 deanna kernel: esi: 00000011 edi: 00050101 ebp: 00010000 esp: 0096ef68
> Apr 24 19:29:08 deanna kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
> Apr 24 19:29:08 deanna kernel: Process zsh (pid: 221, process nr: 43, stackpage=0096e000)
> Apr 24 19:29:08 deanna kernel: Stack: 00976914 0096efbc 00000011 00010000 00000011 0096efbc 0010a255 00000011
> Apr 24 19:29:08 deanna kernel: 00976914 00010000 0096efbc 00010000 0096efbc 0809578c bffff7f4 00109ba7
> Apr 24 19:29:08 deanna kernel: 00010000 0096efbc 00976414 00000000 0010a379 00000000 00000000 fffefffe
> Apr 24 19:29:08 deanna kernel: Call Trace: [do_signal+529/604] [sys_sigsuspend+59/76] [system_call+89/160]
> Apr 24 19:29:08 deanna kernel: Code: 83 84 10 83 7b 08 00 7d 06 c7 03 00 00 00 00 a1 0c 47 1f 00

Ok, it crashes because the code segment is corrupted: instead of the
expected "addl $10,%esp", the one-bit error has resulted in a instruction
that says "addl $0x7d,0x87b83(%eax,%edx,1)", and it's not surprising that
the kernel is unhappy about that ;-)

This seems to have been while strying to set up stack frame for a
"SIGCHLD" signal, and not surprisingly the error never goes away (in
fact, it only gets worse - as one process gets killed because of the bad
code segment, the parent also gets a SIGCHLD, and..)

> Crash #2 (uptime of about 1/2 hour)
> Apr 24 19:50:16 deanna kernel: general protection: 0000
> Apr 24 19:50:16 deanna kernel: CPU: 0
> Apr 24 19:50:16 deanna kernel: EIP: 0010:[wake_up_interruptible+34/216]
> Apr 24 19:50:16 deanna kernel: EFLAGS: 00010202
> Apr 24 19:50:16 deanna kernel: eax: 0093ea94 ebx: 00ac4000 ecx: 0000014c edx: 7273752f
> Apr 24 19:50:16 deanna kernel: esi: 0000014c edi: 0093ea94 ebp: 00848c94 esp: 00848c88
> Apr 24 19:50:16 deanna kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
> Apr 24 19:50:16 deanna kernel: Process rpc.nfsd (pid: 80, process nr: 17, stackpage=00848000)
> Apr 24 19:50:16 deanna kernel: Stack: 008eca04 0000014c 00000000 001d9478 0014956e 0093ea94 008eca04 00134b58
> Apr 24 19:50:16 deanna kernel: 008eca04 008eca04 010e74d0 00134e43 008eca04 010e74d0 00000300 00000000
> Apr 24 19:50:16 deanna kernel: 00000320 001352a2 010e74d0 00000000 00000320 010e74be 0018e2b5 010e74d0
> Apr 24 19:50:16 deanna kernel: Call Trace: [def_callback3+42/64] [sock_wfree+36/48] [kfree_skb+167/232] [dev_kfree_skb+62/76] [el3_start_xmit+425/528] [dev_queue_xmit+375/444] [ip_build_xmit+1490/2952]
> Apr 24 19:50:16 deanna kernel: [rw_intr+250/804] [do_sd_request+346/364] [requeue_sd_request+40/3492] [udp_send+208/244] [udp_getfrag+0/192] [udp_sendto+220/292] [skb_free_datagram+106/112] [udp_sendmsg+52/188]
> Apr 24 19:50:16 deanna kernel: [inet_sendmsg+153/176] [sys_sendto+290/316] [lookup+218/244] [sys_socketcall+495/732] [system_call+89/160]
> Apr 24 19:50:16 deanna kernel: Code: 8b 02 83 f8 01 75 61 9c 5e fa c7 02 00 00 00 00 83 7a 4c 00

Ok, this is the wake_up_interruptible() loop, it disassembles to

movl (%edx),%eax
cmpl $0x1,%eax
jne *+x

which is essentially just the C code fragment

if (p->state == TASK_INTERRUPTIBLE)

with the "p" pointer corrupted for some reason. In fact, the pointer
isn't a pointer any more: it's 0x7273752f, which looks like an ASCII
string ("/usr" to be exact). Interesting.

> Apr 24 21:11:56 deanna kernel: general protection: 0000
> Apr 24 21:11:56 deanna kernel: CPU: 0
> Apr 24 21:11:56 deanna kernel: EIP: 0010:[handle_signal+91/120]
> Apr 24 21:11:56 deanna kernel: EFLAGS: 00010202
> Apr 24 21:11:56 deanna kernel: eax: 40061d9c ebx: 01f0e8e4 ecx: 00000000 edx: 00002000
> Apr 24 21:11:56 deanna kernel: esi: 0000000e edi: ffffffff ebp: 00002000 esp: 01f08f68
> Apr 24 21:11:56 deanna kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
> Apr 24 21:11:56 deanna kernel: Process update (pid: 11, process nr: 11, stackpage=01f08000)
> Apr 24 21:11:56 deanna kernel: Stack: 01f0e8e4 01f08fbc 0000000e 00002000 0000000e 01f08fbc 0010a255 0000000e
> Apr 24 21:11:56 deanna kernel: 01f0e8e4 00002000 01f08fbc 00002000 01f08fbc bffffe18 00000005 00109ba7
> Apr 24 21:11:56 deanna kernel: 00002000 01f08fbc 01f0e414 00000000 0010a379 00000000 00000000 00000000
> Apr 24 21:11:56 deanna kernel: Call Trace: [do_signal+529/604] [sys_sigsuspend+59/76] [system_call+89/160]
> Apr 24 21:11:56 deanna kernel: Code: 83 84 10 83 7b 08 00 7d 06 c7 03 00 00 00 00 a1 0c 47 1f 00

But this is _exactly_ the same bug as in crash #1.

Are you sure you don't have a bad SIMM in there somewhere? A wild pointer
that happens to strike twice in exactly the same spot is a rather strange
thing, especially as that pointer points to the text segment rather than
to some data (which is the normal pointer corruption problem: we point to
old data that is used for something else now).

> Apr 24 21:12:32 deanna kernel: Unable to handle kernel NULL pointer dereference at virtual address c000010c
> Apr 24 21:12:32 deanna kernel: current->tss.cr3 = 00101000, ^Lr3 = 00101000
> Apr 24 21:12:32 deanna kernel: *pde = 00102067
> Apr 24 21:12:32 deanna kernel: *pte = 00000027
> Apr 24 21:12:32 deanna kernel: Oops: 0002
> Apr 24 21:12:32 deanna kernel: CPU: 0
> Apr 24 21:12:32 deanna kernel: EIP: 0010:[exit_notify+288/476]
> Apr 24 21:12:32 deanna kernel: EFLAGS: 00010212
> Apr 24 21:12:32 deanna kernel: eax: 00000000 ebx: 0087f414 ecx: 00000010 edx: 01f0e810
> Apr 24 21:12:32 deanna kernel: esi: 00094018 edi: 00000000 ebp: 00000000 esp: 01fcdf9c
> Apr 24 21:12:32 deanna kernel: ds: 0018 es: 0018 fs: 0000 gs: 0000 ss: 0018
> Apr 24 21:12:32 deanna kernel: Process sendmail (pid: 617, process nr: 30, stackpage=01fcd000)
> Apr 24 21:12:32 deanna kernel: Stack: 00279718 00113e6e 01f0e810 ffffffff fffffffc 00113eb2 00000000 0010a379
> Apr 24 21:12:32 deanna kernel: 00000000 00000000 00001590 ffffffff fffffffc 00000000 ffffffda 0808002b
> Apr 24 21:12:32 deanna sendmail[616]: NOQUEUE: SYSERR(root): SMTP-MAIL: died on signal 11
> Apr 24 21:12:32 deanna kernel: 0000002b 0000002b 0000002b 00000001 4003d36e 00100023 00000292 bfffe2dc
> Apr 24 21:12:32 deanna kernel: Call Trace: [do_exit+414/468] [sys_exit+14/16] [system_call+89/160]
> Apr 24 21:12:32 deanna kernel: Code: 89 98 0c 01 00 00 8b 83 04 01 00 00 89 98 08 01 00 00 8b 03

This is

.L1724:
movl 0x104(%ebx),%eax
movl 0x108(%eax),%eax
movl %eax,0x110(%ebx)
** movl %ebx,0x10c(%eax) **
movl 0x104(%ebx),%eax
movl %ebx,0x108(%eax)
movl (%ebx),%eax
cmpl $0x3,%eax

which is the code sequence

p_osptr = p->p_pptr->p_cptr;
** p->p_osptr->p_ysptr = p; **
p->p_pptr->p_cptr = p;
if (p->state == TASK_ZOMBIE)
notify_parent(p);

It looks like "p->p_osptr" (the pointer to the older sibling) is NULL due
to "p->p_pptr->p_pcptr" (the parent pointers child pointer) being NULL.
That shouldn't happen, and is probably due to the earlier panics messing
up the process pointers when the processes have died in critical regions
(or it might be that when init dies bad things happen when we try to give
children to init for reaping ;-)

> Crash #3 (uptime of 50 minutes)
> Apr 25 10:43:31 deanna kernel: Unable to handle kernel paging request at virtual address c80d734b
> Apr 25 10:43:31 deanna kernel: current->tss.cr3 = 0062e000, ^Lr3 = 0062e000
> Apr 25 10:43:31 deanna kernel: *pde = 00000000
> Apr 25 10:43:31 deanna kernel: Oops: 0000
> Apr 25 10:43:31 deanna kernel: CPU: 0
> Apr 25 10:43:31 deanna kernel: EIP: 0010:[handle_signal+91/120]
> Apr 25 10:43:31 deanna kernel: EFLAGS: 00010216
> Apr 25 10:43:31 deanna kernel: eax: 0804f7c8 ebx: 0062f0ec ecx: 00000000 edx: 00000000
> Apr 25 10:43:31 deanna kernel: esi: 0000000e edi: ffffffff ebp: 00000000 esp: 00634f74
> Apr 25 10:43:31 deanna kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
> Apr 25 10:43:31 deanna kernel: Process rpc.nfsd (pid: 80, process nr: 22, stackpage=00634000)
> Apr 25 10:43:31 deanna kernel: Stack: 0062f0ec 00634fbc 0000000e 00000000 0000000e 00634fbc 0010a255 0000000e
> Apr 25 10:43:31 deanna kernel: 0062f0ec 00000000 00634fbc 00000000 bffffdc8 bffffde8 bffffe1c 0010a462
> Apr 25 10:43:31 deanna kernel: 00000000 00634fbc bffffda8 bffffda8 00000004 bffffdc8 bffffde8 bffffe1c
> Apr 25 10:43:31 deanna kernel: Call Trace: [do_signal+529/604] [signal_return+18/64]
> Apr 25 10:43:31 deanna kernel: Code: 83 84 10 83 7b 08 00 7d 06 c7 03 00 00 00 00 a1 0c 47 1f 00

Exactly the same one-bit error in the code again..

Linus