Scheduler grouping failure; division by zero in select_task_rq_fair

From: Ben Hutchings
Date: Sun Nov 28 2010 - 15:15:37 EST


On Sun, 2010-11-28 at 06:00 +0100, Frede Feuerstein wrote:
[...]
> > The division by zero appears to be a result of getting bad information
> > from the firmware about the groups of processors.
>
> Well, technically a division error always is a result of bad data fed to
> that division. I rather meant, that this is the point to backtrace the
> error.
> Though the bios of the w2100z is known for some problems, the cpus are
> reported correctly by the bios and it is the latest version (R01-B5-S1).
>
> > I realise that this
> > same bad information did not previously result in a crash, but I (and
> > the upstream developers) need to know what that information is before we
> > can understand how this can be avoided.
>
> Are there any means to gather more information ? Tell me and i shall do
> it.

I think this is now enough information.

Ingo, Peter, the output from scheduler domain/group setup was:

[ 0.536554] CPU0 attaching sched-domain:
[ 0.540004] domain 0: span 0-1 level MC
[ 0.548002] groups: 0 1
[ 0.560003] domain 1: span 0-3 level NODE
[ 0.568002] groups:
[ 0.574179] ERROR: domain->cpu_power not set
[ 0.576002]
[ 0.580002] ERROR: groups don't span domain->span
[ 0.584004] CPU1 attaching sched-domain:
[ 0.588007] domain 0: span 0-1 level MC
[ 0.596002] groups: 1 0 (cpu_power = 1023)
[ 0.612002] ERROR: parent span is not a superset of domain->span
[ 0.616003] domain 1: span 1-3 level CPU
[ 0.624002] groups: 1 (cpu_power = 2048) 2-3 (cpu_power = 2048)
[ 0.644003] domain 2: span 0-3 level NODE
[ 0.652004] groups: 1-3 (cpu_power = 4096)
[ 0.668002] ERROR: domain->cpu_power not set
[ 0.672002]
[ 0.676002] ERROR: groups don't span domain->span
[ 0.680004] CPU2 attaching sched-domain:
[ 0.684003] domain 0: span 2-3 level MC
[ 0.692003] groups: 2 3
[ 0.704003] domain 1: span 1-3 level CPU
[ 0.712003] groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
[ 0.736003] domain 2: span 0-3 level NODE
[ 0.744003] groups: 1-3 (cpu_power = 4096)
[ 0.760003] ERROR: domain->cpu_power not set
[ 0.764003]
[ 0.768003] ERROR: groups don't span domain->span
[ 0.772004] CPU3 attaching sched-domain:
[ 0.776003] domain 0: span 2-3 level MC
[ 0.784003] groups: 3 2
[ 0.794183] domain 1: span 1-3 level CPU
[ 0.800003] groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
[ 0.822183] domain 2: span 0-3 level NODE
[ 0.828003] groups: 1-3 (cpu_power = 4096)
[ 0.842180] ERROR: domain->cpu_power not set
[ 0.844003]
[ 0.848003] ERROR: groups don't span domain->span

and the oops is:

[ 0.852154] divide error: 0000 [#1] SMP
[ 0.856002] last sysfs file:
[ 0.856002] CPU 1
[ 0.856002] Modules linked in:
[ 0.856002] Pid: 2, comm: kthreadd Not tainted 2.6.32-5-amd64 #1 W1100z/2100z
[ 0.856002] RIP: 0010:[<ffffffff810416e9>] [<ffffffff810416e9>] select_task_rq_fair+0x665/0 x800
[ 0.856002] RSP: 0018:ffff88003fdb7c90 EFLAGS: 00010046
[ 0.856002] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 0.856002] RDX: 0000000000000000 RSI: 0000000000000200 RDI: 0000000000000200
[ 0.856002] RBP: ffff88004120fd50 R08: 0000000000000000 R09: ffff88007f98f0b0
[ 0.856002] R10: 0000000000000000 R11: 00000000000252d0 R12: ffff88007f98f060
[ 0.856002] R13: ffff88007f98f070 R14: ffffffffffffffff R15: 0000000000015780
[ 0.856002] FS: 0000000000000000(0000) GS:ffff880041200000(0000) knlGS:0000000000000000
[ 0.856002] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 0.856002] CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
[ 0.856002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.856002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 0.856002] Process kthreadd (pid: 2, threadinfo ffff88003fdb6000, task ffff88003fdc8710)
[ 0.856002] Stack:
[ 0.856002] 0000000000015780 0000000000015780 0000000000015780 0000000000015780
[ 0.856002] <0> 0000000000015780 0000000000015788 0000000000015788 ffffffff8146c260
[ 0.856002] <0> 0000000800000000 ffff88007f9b0000 ffff880041215780 0000000081317f88
[ 0.856002] Call Trace:
[ 0.856002] [<ffffffff8104d2b2>] ? copy_process+0x1007/0x115f
[ 0.856002] [<ffffffff810475f4>] ? select_task_rq+0xb/0x3e
[ 0.856002] [<ffffffff8104b53b>] ? wake_up_new_task+0x35/0xf6
[ 0.856002] [<ffffffff8104d65e>] ? do_fork+0x254/0x31e
[ 0.856002] [<ffffffff81041aa9>] ? pick_next_task_fair+0xca/0xd6
[ 0.856002] [<ffffffff8104802b>] ? finish_task_switch+0x3a/0xaf
[ 0.856002] [<ffffffff81011b42>] ? kernel_thread+0x82/0xe0
[ 0.856002] [<ffffffff810648c8>] ? kthread+0x0/0x81
[ 0.856002] [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[ 0.856002] [<ffffffff8106488d>] ? kthreadd+0xb1/0xec
[ 0.856002] [<ffffffff814f3140>] ? early_idt_handler+0x0/0x71
[ 0.856002] [<ffffffff81011baa>] ? child_rip+0xa/0x20
[ 0.856002] [<ffffffff814f3140>] ? early_idt_handler+0x0/0x71
[ 0.856002] [<ffffffff810dfda5>] ? do_set_mempolicy+0x128/0x13a
[ 0.856002] [<ffffffff810647dc>] ? kthreadd+0x0/0xec
[ 0.856002] [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[ 0.856002] Code: 00 02 00 00 4c 89 ef 48 63 d2 e8 0f c6 14 00 3b 05 ad 33 49 00 89 c2 0f 8c 6f ff ff ff 41 8b 4c 24 08 48 c1 e3 0a 31 d2 48 89 d8 <48> f7 f1 83 bc 24 a8 00 00 00 00 48 89 c1 75 22 4c 39 f0 73 15
[ 0.856002] RIP [<ffffffff810416e9>] select_task_rq_fair+0x665/0x800
[ 0.856002] RSP <ffff88003fdb7c90>
[ 0.856002] ---[ end trace a22d306b065d4a66 ]---

There's more information in the bug log at <http://bugs.debian.org/603229>.

If you think this has been fixed since 2.6.32 (I didn't see any relevant
changes) then we have a package of 2.6.36 which Frede can test.

Ben.

--
Ben Hutchings, Debian Developer and kernel team member

Attachment: signature.asc
Description: This is a digitally signed message part