Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(),during CPU hotplug

From: Srivatsa S. Bhat
Date: Fri Sep 14 2012 - 08:18:44 EST


On 09/14/2012 05:17 PM, Fengguang Wu wrote:
> On Thu, Sep 13, 2012 at 02:05:17PM +0530, Srivatsa S. Bhat wrote:
>> On 09/12/2012 06:06 PM, Srivatsa S. Bhat wrote:
>>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>>> Hi Paul,
>>>>>
>>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>>> I haven't tried to debug this yet...
>>>>>
>>>>> Line number 1550 maps to:
>>>>>
>>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>>
>>>>> inside rcu_do_batch().
>>>>
>>>> Hello, Srivatsa,
>>>>
>>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>>> for 3.6. Please see below for the patch.
>>>>
>>>> Does this help?
>>>>
>>>
>>> Hi Paul,
>>>
>>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>>> branch in -tip. It does contain the commit that you had mentioned above.
>>>
>>
>> I also hit some writeback related problems during some of these runs. But I was
>> not able to reproduce them after that occurrence. (Adding relevant people to CC.)
>>
>> I hit the divide error shown below during the CPU hotplug test run, and the general
>
> I've hit the divide error before. And I'm not luckier than you in
> reproducing the bug. I tempt to add a test as the workaround:
>

Hi Fengguang,
First of all, thanks for taking a look!

I just hit it today during my testing again! See below for the stack-traces..

> + if (WARN_ON(!denominator))
> + return dirty;

Yep, I am going to try this out next and see if the denominator really became 0.
I went through the code and I didn't find anything wrong with it. It looks pretty
good (the denominator sanity checking part). Hmm...

>
> btw, any chance you may share the CPU hotplug test scripts?
> It'd be a valuable addition to my 0day boot test system.
>

Sure, I was using a very simple script that does offline/online in sequence. That's all.
Please find it below. Note that adding a breather between the hotplug operations (like
adding sleep as shown below) has the potential to expose different bugs than the normal
stress test (no sleeps between subsequent operations). So both cases are useful to test.
For example, with the sleep, I hit this divide error with higher probability, whereas
without the sleep (stress test) I hit the rcu_do_batch warning more often.

(And on a slightly different note, when x86 supports CPU 0 hotplug (today it doesn't),
you can use i = 0 instead of i = 1 in the loop. I believe that code is already in -tip).


CPU hotplug test script
-----------------------
#! /bin/bash

NUMBER_OF_CPUS=`ls -d /sys/devices/system/cpu/cpu[0-9]* | wc -l`

cd /sys/devices/system/cpu

while [ 1 ]
do
for ((i=1; i < NUMBER_OF_CPUS; i++))
do
sleep 1;
state=`cat cpu$i/online`
if [ $state -eq 0 ]
then
echo 1 > cpu$i/online
else
echo 0 > cpu$i/online
fi
done
done


>> protection fault subsequently, while trying to shutdown the machine after the test.
>

Dumps from the divide error and general protection fault in today's testing:


[ 4831.353623] SMP alternatives: lockdep: fixing up alternatives
[ 4831.359465] smpboot: Booting Node 1 Processor 13 APIC 0x13
[ 4833.028342] SMP alternatives: lockdep: fixing up alternatives
[ 4833.034188] smpboot: Booting Node 1 Processor 14 APIC 0x15
[ 4835.693836] SMP alternatives: lockdep: fixing up alternatives
[ 4835.699609] smpboot: Booting Node 1 Processor 15 APIC 0x17
[ 4843.142309] Broke affinity for irq 77
[ 4843.147605] smpboot: CPU 1 is now offline
[ 4843.155885] CPU 9 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4844.379459] Broke affinity for irq 80
[ 4844.384569] smpboot: CPU 2 is now offline
[ 4844.392111] CPU 10 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4845.810713] smpboot: CPU 3 is now offline
[ 4845.821409] CPU 11 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4848.710692] Broke affinity for irq 74
[ 4848.716063] smpboot: CPU 4 is now offline
[ 4848.721781] CPU 5 MCA banks CMCI:6 CMCI:8
[ 4848.726440] CPU 12 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4855.990349] smpboot: CPU 5 is now offline
[ 4855.997251] CPU 6 MCA banks CMCI:6 CMCI:8
[ 4856.002002] CPU 13 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4859.198420] smpboot: CPU 6 is now offline
[ 4859.205217] CPU 7 MCA banks CMCI:6 CMCI:8
[ 4859.209858] CPU 14 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4861.202261] smpboot: CPU 7 is now offline
[ 4861.211009] CPU 12 MCA banks CMCI:6 CMCI:8
[ 4861.215403] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4863.210557] smpboot: CPU 8 is now offline
[ 4890.536087] divide error: 0000 [#1] SMP
[ 4890.540047] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode cdc_ether usbnet i2c_i801 mii tpm_tis pcspkr ioatdma i7core_edac tpm serio_raw i2c_core bnx2 shpchp lpc_ich mfd_core edac_core pci_hotplug dca tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4890.540047] CPU 9
[ 4890.540047] Pid: 2243, comm: flush-8:0 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033
[ 4890.540047] RIP: 0010:[<ffffffff811276f6>] [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
[ 4890.540047] RSP: 0018:ffff8808d3e1fcc0 EFLAGS: 00010202
[ 4890.540047] RAX: 0000000000000000 RBX: 00000000001a80c4 RCX: 28f5c28f5c28f5c3
[ 4890.540047] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 4890.540047] RBP: ffff8808d3e1fce0 R08: 0000000000000010 R09: 0000000000000000
[ 4890.540047] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8808d3d80e20
[ 4890.540047] R13: ffff8808d3d80e20 R14: ffff8808d3d811a0 R15: 0000000000000000
[ 4890.540047] FS: 0000000000000000(0000) GS:ffff8808ddd40000(0000) knlGS:0000000000000000
[ 4890.540047] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4890.540047] CR2: ffffffffff600400 CR3: 0000000001a0c000 CR4: 00000000000007e0
[ 4890.540047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4890.540047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4890.540047] Process flush-8:0 (pid: 2243, threadinfo ffff8808d3e1e000, task ffff8808d4060000)
[ 4890.540047] Stack:
[ 4890.540047] 0001000000000000 0000000000000000 ffff8808d3d80e20 fffffffffffffff7
[ 4890.540047] ffff8808d3e1fd10 ffffffff811ae95c 0000000000350189 00000000001a80c4
[ 4890.540047] 0000000000000000 0000000000000000 ffff8808d3e1fdc0 ffffffff811b0620
[ 4890.540047] Call Trace:
[ 4890.540047] [<ffffffff811ae95c>] over_bground_thresh+0x7c/0x90
[ 4890.540047] [<ffffffff811b0620>] wb_do_writeback+0x170/0x310
[ 4890.540047] [<ffffffff811b08eb>] bdi_writeback_thread+0x12b/0x420
[ 4890.540047] [<ffffffff811b07c0>] ? wb_do_writeback+0x310/0x310
[ 4890.540047] [<ffffffff8106deae>] kthread+0xde/0xf0
[ 4890.540047] [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
[ 4890.540047] [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
[ 4890.540047] [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
[ 4890.540047] [<ffffffff814c6180>] ? gs_change+0x13/0x13
[ 4890.540047] Code: 28 5c 8f c2 f5 28 8b 7d e0 48 89 c6 48 0f af f3 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 e8 48 89 f0 <48> f7 f7 41 8b 94 24 74 02 00 00 48 0f af d3 48 89 c7 48 c1 ea
[ 4890.540047] RIP [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
[ 4890.540047] RSP <ffff8808d3e1fcc0>
[ 4890.805627] ---[ end trace 3bdb61d99b220aeb ]---
[ 4890.818925] Broke affinity for irq 76
[ 4890.823921] smpboot: CPU 9 is now offline
[ 4900.238238] general protection fault: 0000 [#2] SMP
[ 4900.242203] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode cdc_ether usbnet i2c_i801 mii tpm_tis pcspkr ioatdma i7core_edac tpm serio_raw i2c_core bnx2 shpchp lpc_ich mfd_core edac_core pci_hotplug dca tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4900.248087] CPU 13
[ 4900.248087] Pid: 0, comm: swapper/13 Tainted: G D 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033
[ 4900.248087] RIP: 0010:[<ffffffff810843d7>] [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
[ 4900.248087] RSP: 0018:ffff88117fd43d30 EFLAGS: 00010002
[ 4900.248087] RAX: 6b6b6b6b6b6b6b6b RBX: 000000000000000f RCX: 000000006b6b6b6b
[ 4900.248087] RDX: 000000006b6c6b6b RSI: ffffffff817a7fbf RDI: ffff8808d40607f0
[ 4900.248087] RBP: ffff88117fd43d70 R08: 0000000000000002 R09: 0000000000000001
[ 4900.248087] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8808d4060000
[ 4900.248087] R13: 000000006b6b6b6b R14: ffff8808d3d80e20 R15: 0000000000000000
[ 4900.248087] FS: 0000000000000000(0000) GS:ffff88117fd40000(0000) knlGS:0000000000000000
[ 4900.248087] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4900.248087] CR2: 00007f472ef5f068 CR3: 0000000001a0c000 CR4: 00000000000007e0
[ 4900.248087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4900.248087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4900.248087] Process swapper/13 (pid: 0, threadinfo ffff8808d762e000, task ffff8808d7630000)
[ 4900.248087] Stack:
[ 4900.248087] 0000000000000246 ffff8808d40607f0 0000000000000286 ffff8808d762ffd8
[ 4900.248087] ffff8808d3d80e20 ffff8808d3d811a0 ffff8808d3d80e20 ffffffff8113d970
[ 4900.248087] ffff88117fd43d80 ffffffff810846a0 ffff88117fd43db0 ffffffff8113da22
[ 4900.248087] Call Trace:
[ 4900.248087] <IRQ>
[ 4900.248087] [<ffffffff8113d970>] ? bdi_init+0x270/0x270
[ 4900.248087] [<ffffffff810846a0>] wake_up_process+0x10/0x20
[ 4900.248087] [<ffffffff8113da22>] wakeup_timer_fn+0xb2/0x1c0
[ 4900.248087] [<ffffffff810557db>] call_timer_fn+0xcb/0x240
[ 4900.248087] [<ffffffff81055710>] ? detach_if_pending+0x150/0x150
[ 4900.248087] [<ffffffff8113d970>] ? bdi_init+0x270/0x270
[ 4900.248087] [<ffffffff81056f75>] run_timer_softirq+0x1c5/0x2c0
[ 4900.248087] [<ffffffff8104dc79>] __do_softirq+0x159/0x400
[ 4900.248087] [<ffffffff814c627c>] call_softirq+0x1c/0x30
[ 4900.248087] [<ffffffff810044f5>] do_softirq+0x95/0xd0
[ 4900.248087] [<ffffffff8104d745>] irq_exit+0xe5/0x100
[ 4900.248087] [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
[ 4900.248087] [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
[ 4900.248087] <EOI>
[ 4900.248087] [<ffffffff812db3d3>] ? intel_idle+0xc3/0x170
[ 4900.248087] [<ffffffff812db3d7>] ? intel_idle+0xc7/0x170
[ 4900.248087] [<ffffffff812db3d3>] ? intel_idle+0xc3/0x170
[ 4900.248087] [<ffffffff813b1937>] cpuidle_enter+0x17/0x20
[ 4900.248087] [<ffffffff813b1952>] cpuidle_enter_state+0x12/0x50
[ 4900.248087] [<ffffffff813b2403>] cpuidle_idle_call+0x133/0x260
[ 4900.248087] [<ffffffff8100b1f5>] cpu_idle+0x95/0xf0
[ 4900.248087] [<ffffffff814b1819>] start_secondary+0x85/0x8c
[ 4900.248087] Code: 31 ed 48 89 c7 48 89 45 c8 e8 46 72 43 00 48 89 45 d0 49 8b 04 24 85 c3 0f 84 02 02 00 00 45 8b 6c 24 2c 49 8b 44 24 08 45 85 ed <44> 8b 70 18 74 75 8b 1d dd ea 9d 00 85 db 0f 85 2d 02 00 00 49
[ 4900.248087] RIP [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
[ 4900.248087] RSP <ffff88117fd43d30>
[ 4900.248087] ---[ end trace 3bdb61d99b220aec ]---
[ 4900.248087] Kernel panic - not syncing: Fatal exception in interrupt



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/