Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

From: Vegard Nossum
Date: Thu Jul 10 2008 - 11:06:52 EST


On Thu, Jul 10, 2008 at 4:16 PM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
>> Regarding new crashes. Do you get them
>>
>> (1) after a few cpu offline / onlines ?
>> (2) on a freshly booted system?
>> (3) (1) or (2) but only with Miao Xie's patch (should not be (2) then)
>> (4) something else?
>
> Without Miao Xie's patch, I regularly get a crash on the first cpu-up.
> So I am using it all the time. With this patch applied, the new
> crashes can happen from anywhere between 2 minutes to 20 while running
> a few different looping scripts simultaneously:
>
> 1. cpu up/down
> 2. grep -r . /sys
> 3. swapon/swapoff
> 4. cat /dev/cpu/*/msr

Inhibiting #1 kept the machine alive for at least 25 minutes. Then I
started it and it hung after 492 rounds of cpu up/down, with this new
report:

list_add corruption. next->prev should be prev (f782d090), but was
00000000. (next=f20b8438).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:27!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Pid: 3860, comm: bash Not tainted (2.6.26-rc9-00059-gb190333 #5)
EIP: 0060:[<c0294b80>] EFLAGS: 00210086 CPU: 0
EIP is at __list_add+0x40/0x60
EAX: 00000061 EBX: f782d090 ECX: 00000002 EDX: 00000002
ESI: 00200282 EDI: c0a8de8c EBP: e7dd3e84 ESP: e7dd3e6c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3860, ti=e7dd2000 task=e7e0afd0 task.ti=e7dd2000)
Stack: c067fd30 f782d090 00000000 f20b8438 f20b81b0 00200282 e7dd3e8c c0294baa
e7dd3e98 c019bf5e f782d070 e7dd3ec8 c019c4b1 c0a8deac 000000d0 e7de2a00
c1da1ddc 00000005 00000000 f20b81b0 fff95000 fff98000 fff96000 e7dd3ed4
Call Trace:
[<c0294baa>] ? list_add+0xa/0x10
[<c019bf5e>] ? __mem_cgroup_add_list+0x3e/0x40
[<c019c4b1>] ? mem_cgroup_charge_common+0x231/0x260
[<c019c522>] ? mem_cgroup_charge+0x12/0x20
[<c01843e7>] ? do_wp_page+0x117/0x550
[<c01864c1>] ? handle_mm_fault+0x1b1/0x770
[<c01866f1>] ? handle_mm_fault+0x3e1/0x770
[<c014cc95>] ? down_read_trylock+0x55/0x60
[<c0120d98>] ? do_page_fault+0x298/0x700
[<c0584b26>] ? _spin_unlock_irq+0x36/0x60
[<c01402db>] ? sigprocmask+0x7b/0xf0
[<c0104df5>] ? restore_nocheck+0x12/0x15
[<c0120b00>] ? do_page_fault+0x0/0x700
[<c0584e6a>] ? error_code+0x72/0x78
=======================
Code: 75 2d 89 08 89 41 04 89 02 89 50 04 83 c4 10 5b 5e 5d c3 89 4c
24 0c 89 54 24 08 89 5c 24 04 c7 04 24 30 fd 67 c0 e8 80 0c ea ff <0
f> 0b eb fe 89 5c 24 0c 89 74 24 08 89 4c 24 04 c7 04 24 80 fd
EIP: [<c0294b80>] __list_add+0x40/0x60 SS:ESP 0068:e7dd3e6c
---[ end trace 89a65901b268513f ]---

The list corruption now has a completely different backtrace, but they
both were 0 instead of some other (expected) value. This fits with the
theory that something is zeroed that shouldn't be.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/