Re: Crashes with 874bbfe600a6 in 3.18.25

From: Thomas Gleixner
Date: Wed Feb 03 2016 - 05:42:51 EST


On Wed, 3 Feb 2016, Jiri Slaby wrote:
> On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> What happens in later kernels, when the cpu is offlined before the
> delayed_work timer ticks? In stable 3.12, with the patch, this scenario
> results in an oops:
> #5 [ffff8c03fdd63d80] page_fault at ffffffff81523a88
> [exception RIP: __queue_work+121]
> RIP: ffffffff81071989 RSP: ffff8c03fdd63e30 RFLAGS: 00010086
> RAX: ffff88048b96bc00 RBX: ffff8c03e9bcc800 RCX: ffff880473820478
> RDX: 0000000000000400 RSI: 0000000000000004 RDI: ffff880473820458
> RBP: 0000000000000000 R8: ffff8c03fdd71f40 R9: ffff8c03ea4c4002
> R10: 0000000000000000 R11: 0000000000000005 R12: ffff880473820458
> R13: 00000000000000a8 R14: 000000000000e328 R15: 00000000000000a8
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #6 [ffff8c03fdd63e68] call_timer_fn at ffffffff81065611
> #7 [ffff8c03fdd63e98] run_timer_softirq at ffffffff810663b7
> #8 [ffff8c03fdd63f00] __do_softirq at ffffffff8105e2c5
> #9 [ffff8c03fdd63f68] call_softirq at ffffffff8152cf9c
> #10 [ffff8c03fdd63f80] do_softirq at ffffffff81004665
> #11 [ffff8c03fdd63fa0] smp_apic_timer_interrupt at ffffffff8152d835
> #12 [ffff8c03fdd63fb0] apic_timer_interrupt at ffffffff8152c2dd
>
> The CPU was 168, and that one was offlined in the meantime. So
> __queue_work fails at:
> if (!(wq->flags & WQ_UNBOUND))
> pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> else
> pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> ^^^ ^^^^ NODE is -1
> \ pwq is NULL
>
> if (last_pool && last_pool != pwq->pool) { <--- BOOM

I don't see how that works on later kernels. If cpu_to_node() returns -1 we
access outside of the array bounds....

Thanks,

tglx