Re: percpu related boot crash on x86 (was: Linux 2.6.38-rc1)

From: Tejun Heo
Date: Wed Jan 19 2011 - 07:44:32 EST


Hello, Ingo.

On Wed, Jan 19, 2011 at 01:02:00PM +0100, Ingo Molnar wrote:
>
> There's a rather frequent, percpu related boot crash that I can see with .38-rc1:
> [ 0.000000] NR_IRQS:4352
> [ 0.000000] ------------[ cut here ]------------
> [ 0.000000] WARNING: at kernel/smp.c:433 smp_call_function_many+0x90/0x209()
...
> [ 0.000000] [<ffffffff81076299>] ? on_each_cpu+0x1b/0x39
> [ 0.000000] [<ffffffff810274e6>] ? flush_tlb_all+0x1c/0x1e
> [ 0.000000] [<ffffffff810dc7d7>] ? remove_vm_area+0x71/0x96
> [ 0.000000] [<ffffffff810dc868>] ? __vunmap+0x3f/0xcf
> [ 0.000000] [<ffffffff810dc9db>] ? vfree+0x2c/0x2e
> [ 0.000000] [<ffffffff810ccca6>] ? pcpu_mem_free+0x1e/0x20
> [ 0.000000] [<ffffffff810ccd75>] ? pcpu_extend_area_map+0x9a/0xb6
> [ 0.000000] [<ffffffff810cd452>] ? pcpu_alloc+0x17e/0x916
> [ 0.000000] [<ffffffff8106bb00>] ? trace_hardirqs_off+0xd/0xf
> [ 0.000000] [<ffffffff810e5bed>] ? kmem_cache_alloc_trace+0xab/0x120
> [ 0.000000] [<ffffffff810cdbfa>] ? __alloc_percpu+0x10/0x12
> [ 0.000000] [<ffffffff8180afd4>] ? early_irq_init+0xb2/0x13d
...

This is vfree() path used before local irq is enabled during early
boot. vfree() triggered TLB flush (maybe debug enabled?) which used
on_each_cpu() which isn't quite happy to be called with local irq
diabled.

> [ 0.000000] general protection fault: 01bb [#1] SMP DEBUG_PAGEALLOC
...
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff810068a4>] init_8259A+0xe3/0xe8
> [ 0.000000] [<ffffffff817f7d71>] init_ISA_irqs+0x2f/0x5a
> [ 0.000000] [<ffffffff817f7de1>] native_init_IRQ+0xe/0xa2
> [ 0.000000] [<ffffffff817f7dd1>] init_IRQ+0x35/0x37
> [ 0.000000] [<ffffffff817f4a0b>] start_kernel+0x1ff/0x3a4
> [ 0.000000] [<ffffffff817f42a6>] x86_64_start_reservations+0xb6/0xba
> [ 0.000000] [<ffffffff817f43a1>] x86_64_start_kernel+0xf7/0xfe
> [ 0.000000] Code: 18 48 89 f3 be 01 00 00 00 e8 33 fe cd ff 4c 89 e7 e8 77 1f e2 ff f6 c7 02 75 09 53 9d e8 a0 bf cd ff eb 07 e8 74 08 ce ff 53 9d <5b> 41 5c c9 c3 55 48 89 e5 53 48 83 ec 08 e8 91 2c c7 ff 48 8b
> [ 0.000000] RIP [<ffffffff8138fb5c>] _raw_spin_unlock_irqrestore+0x41/0x4

and this looks like alloc_percpu() failed earlier during early irq
init. The irq init functions don't check for NULL return so it just
goes off later. I'll see if I can reproduce the problem here.

It doesn't look like anything hardware dependent. The first warning
seems more or less spurious and the GPF seems to be caused by earlier
memory allocation failure. It's a bit curious that the allocation
failed on a x86_64 machine tho.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/