Re: BUG_ON in rcu_sync_func triggered

From: Nikolay Borisov
Date: Wed Sep 14 2016 - 02:30:34 EST




On 09/13/2016 06:20 PM, Oleg Nesterov wrote:
> On 09/13, Nikolay Borisov wrote:
>>
>> On 09/13/2016 05:35 PM, Nikolay Borisov wrote:
>>>
>>> On 09/13/2016 04:43 PM, Oleg Nesterov wrote:
>>>> On 09/13, Oleg Nesterov wrote:
>>>>>
>>>>> OK... perhaps the unbalanced up_write... I'll try to look at freeze/thaw code,
>>>>
>>>> Heh, yes, it looks racy or I am totally confused.
>>>>
>>>>> could test the debugging patch below meanwhile?
>>>>
>>>> Yes please. I'll send you another patch (hopefully fix) later, but it
>>>> would be nice if you can test this patch to get more info.
>>>
>>> I've already started testing with this patch on 4.4.20 this time
>
> I think it would be better to stay with the same kernel version to
> debug the problem...
>
>> Actually forget that, here is a warning that this triggered:
>>
>> [ 844.290454] WARNING: CPU: 2 PID: 1900 at kernel/rcu/sync.c:160 rcu_sync_func+0xc8/0x150()
>> ...
>> [ 844.754708] XXX: ffff88047527da78 gp=2 cnt=0 cb=1
>
> Hmm. Thanks. Please show us all the warnings you get.

What I showed you yesterday was the only warning that I got. On the same kernel another run produced the following warnings:

[ 556.869624] ------------[ cut here ]------------
[ 556.875128] WARNING: CPU: 6 PID: 20210 at kernel/rcu/sync.c:111 rcu_sync_enter+0x148/0x1a0()
[ 556.885142] Modules linked in: xt_state act_police cls_basic sch_ingress veth rbd libceph openvswitch nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp xt_owner iptable_mangle xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_CT iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ip6table_filter ip6_tables rdma_ucm ib_ucm ib_uverbs rdma_cm iw_cm dm_mirror dm_region_hash dm_log ib_umad ib_ipoib ib_cm ib_sa ib_mad ib_core ib_addr ipv6 x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32_pclmul ixgbe mdio ipmi_devintf ipmi_si ipmi_msghandler igb i2c_algo_bit sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma dca shpchp dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio
[ 556.960028] CPU: 6 PID: 20210 Comm: fsfreeze Not tainted 4.4.20-clouder1 #9
[ 556.968143] Hardware name: Supermicro X9DRW/X9DRW, BIOS 1.0b 10/11/2012
[ 556.975963] 0000000000000000 ffff880256b43cf8 ffffffff81307a9b 0000000000004ef2
[ 556.985182] 0000000000000000 0000000000000000 000000000000006f ffff880256b43d38
[ 556.994383] ffffffff81054a85 ffff880473326b10 ffff880473326b08 ffff880473326b10
[ 557.003589] Call Trace:
[ 557.006656] [<ffffffff81307a9b>] dump_stack+0x6b/0xa0
[ 557.012737] [<ffffffff81054a85>] warn_slowpath_common+0x95/0xe0
[ 557.019781] [<ffffffff81054aea>] warn_slowpath_null+0x1a/0x20
[ 557.026645] [<ffffffff810ab9a8>] rcu_sync_enter+0x148/0x1a0
[ 557.033309] [<ffffffff8109c9be>] percpu_down_write+0x1e/0xf0
[ 557.040074] [<ffffffff81315683>] ? call_rwsem_down_write_failed+0x13/0x20
[ 557.048092] [<ffffffff811a868b>] freeze_super+0xab/0x1b0
[ 557.054456] [<ffffffff811b7c0d>] do_vfs_ioctl+0x29d/0x560
[ 557.060920] [<ffffffff811aae7e>] ? SYSC_newfstat+0x2e/0x40
[ 557.067480] [<ffffffff811b7f62>] SyS_ioctl+0x92/0xa0
[ 557.073465] [<ffffffff8163c357>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 557.081015] ---[ end trace fc087420ac1d8f16 ]---
[ 557.086507] XXX: ffff880473326b08 gp=2 cnt=-1 cb=1
[ 557.092326] rbd: rbd19: added with size 0x500000000

This is: if (WARN_ON(rsp->gp_count < 0)) xxx(rsp);



[ 557.094835] ------------[ cut here ]------------
[ 557.094841] WARNING: CPU: 1 PID: 0 at kernel/rcu/sync.c:159 rcu_sync_func+0xf9/0x150()
[ 557.094873] Modules linked in: xt_state act_police cls_basic sch_ingress veth rbd libceph openvswitch nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp xt_owner iptable_mangle xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_CT iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ip6table_filter ip6_tables rdma_ucm ib_ucm ib_uverbs rdma_cm iw_cm dm_mirror dm_region_hash dm_log ib_umad ib_ipoib ib_cm ib_sa ib_mad ib_core ib_addr ipv6 x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32_pclmul ixgbe mdio ipmi_devintf ipmi_si ipmi_msghandler igb i2c_algo_bit sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma dca shpchp dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio
[ 557.094876] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.4.20-clouder1 #9
[ 557.094877] Hardware name: Supermicro X9DRW/X9DRW, BIOS 1.0b 10/11/2012
[ 557.094879] 0000000000000000 ffff880277c83d78 ffffffff81307a9b 0000000000000000
[ 557.094881] 0000000000000000 0000000000000000 000000000000009f ffff880277c83db8
[ 557.094882] ffffffff81054a85 0000000000000000 ffff880473326bc0 ffff880473326b98
[ 557.094882] Call Trace:
[ 557.094889] <IRQ> [<ffffffff81307a9b>] dump_stack+0x6b/0xa0
[ 557.094892] [<ffffffff81054a85>] warn_slowpath_common+0x95/0xe0
[ 557.094893] [<ffffffff81054aea>] warn_slowpath_null+0x1a/0x20
[ 557.094894] [<ffffffff810ab7b9>] rcu_sync_func+0xf9/0x150
[ 557.094897] [<ffffffff810b0620>] rcu_process_callbacks+0x290/0x740
[ 557.094900] [<ffffffff810590f3>] __do_softirq+0x113/0x330
[ 557.094902] [<ffffffff810593e5>] irq_exit+0x75/0x80
[ 557.094905] [<ffffffff8163ea16>] smp_apic_timer_interrupt+0x46/0x55
[ 557.094907] [<ffffffff8163d069>] apic_timer_interrupt+0x89/0x90
[ 557.094911] <EOI> [<ffffffff81530cbc>] ? cpuidle_enter_state+0x14c/0x2e0
[ 557.094912] [<ffffffff81530cb1>] ? cpuidle_enter_state+0x141/0x2e0
[ 557.094914] [<ffffffff81530e67>] cpuidle_enter+0x17/0x20
[ 557.094916] [<ffffffff81096d70>] cpu_startup_entry+0x310/0x370
[ 557.094919] [<ffffffff810c2aac>] ? clockevents_config_and_register+0x2c/0x40
[ 557.094923] [<ffffffff81035359>] start_secondary+0xf9/0x100
[ 557.094924] ---[ end trace fc087420ac1d8f17 ]---
[ 557.094925] XXX: ffff880473326b98 gp=2 cnt=-1 cb=1

This is:
if (WARN_ON(rsp->gp_count < 0)) xxx(rsp);

I'd say the warnings are pretty consistent between 4.4.14 and 4.4.20.
Another run on the same kernel (no reboots between tries) produced the following:

[ 700.981151] ------------[ cut here ]------------
[ 700.986643] WARNING: CPU: 3 PID: 2428 at kernel/rcu/sync.c:111 rcu_sync_enter+0x148/0x1a0()
[ 700.996541] Modules linked in: xt_state act_police cls_basic sch_ingress veth rbd libceph openvswitch nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp xt_owner iptable_mangle xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_CT iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ip6table_filter ip6_tables rdma_ucm ib_ucm ib_uverbs rdma_cm iw_cm dm_mirror dm_region_hash dm_log ib_umad ib_ipoib ib_cm ib_sa ib_mad ib_core ib_addr ipv6 x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32_pclmul ixgbe mdio ipmi_devintf ipmi_si ipmi_msghandler igb i2c_algo_bit sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma dca shpchp dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio
[ 701.071212] CPU: 3 PID: 2428 Comm: fsfreeze Tainted: G W 4.4.20-clouder1 #9
[ 701.080816] Hardware name: Supermicro X9DRW/X9DRW, BIOS 1.0b 10/11/2012
[ 701.088530] 0000000000000000 ffff88006b5efcf8 ffffffff81307a9b 000000000000097c
[ 701.097711] 0000000000000000 0000000000000000 000000000000006f ffff88006b5efd38
[ 701.106870] ffffffff81054a85 ffff880079ab9310 ffff880079ab9308 ffff880079ab9310
[ 701.116039] Call Trace:
[ 701.119084] [<ffffffff81307a9b>] dump_stack+0x6b/0xa0
[ 701.125149] [<ffffffff81054a85>] warn_slowpath_common+0x95/0xe0
[ 701.132184] [<ffffffff81054aea>] warn_slowpath_null+0x1a/0x20
[ 701.139027] [<ffffffff810ab9a8>] rcu_sync_enter+0x148/0x1a0
[ 701.145674] [<ffffffff8107b0b2>] ? __might_sleep+0x52/0x90
[ 701.152221] [<ffffffff8109c9be>] percpu_down_write+0x1e/0xf0
[ 701.158962] [<ffffffff8107b0b2>] ? __might_sleep+0x52/0x90
[ 701.165511] [<ffffffff811a868b>] freeze_super+0xab/0x1b0
[ 701.171867] [<ffffffff811b7c0d>] do_vfs_ioctl+0x29d/0x560
[ 701.178319] [<ffffffff811aae7e>] ? SYSC_newfstat+0x2e/0x40
[ 701.184959] [<ffffffff811b7f62>] SyS_ioctl+0x92/0xa0
[ 701.190926] [<ffffffff8163c357>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 701.198446] ---[ end trace fc087420ac1d8f18 ]---
[ 701.203915] XXX: ffff880079ab9308 gp=0 cnt=-1 cb=0

[ 701.209624] ------------[ cut here ]------------
[ 701.215193] WARNING: CPU: 3 PID: 2428 at kernel/rcu/sync.c:118 rcu_sync_enter+0x7e/0x1a0()
[ 701.224996] Modules linked in: xt_state act_police cls_basic sch_ingress veth rbd libceph openvswitch nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp xt_owner iptable_mangle xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_CT iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ip6table_filter ip6_tables rdma_ucm ib_ucm ib_uverbs rdma_cm iw_cm dm_mirror dm_region_hash dm_log ib_umad ib_ipoib ib_cm ib_sa ib_mad ib_core ib_addr ipv6 x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32_pclmul ixgbe mdio ipmi_devintf ipmi_si ipmi_msghandler igb i2c_algo_bit sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma dca shpchp dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio
[ 701.299721] CPU: 3 PID: 2428 Comm: fsfreeze Tainted: G W 4.4.20-clouder1 #9
[ 701.309325] Hardware name: Supermicro X9DRW/X9DRW, BIOS 1.0b 10/11/2012
[ 701.317038] 0000000000000000 ffff88006b5efcf8 ffffffff81307a9b 000000000000097c
[ 701.326203] 0000000000000000 0000000000000000 0000000000000076 ffff88006b5efd38
[ 701.335382] ffffffff81054a85 0000000000000007 ffff880079ab9308 ffff880079ab9310
[ 701.344556] Call Trace:
[ 701.347610] [<ffffffff81307a9b>] dump_stack+0x6b/0xa0
[ 701.353674] [<ffffffff81054a85>] warn_slowpath_common+0x95/0xe0
[ 701.360705] [<ffffffff81054aea>] warn_slowpath_null+0x1a/0x20
[ 701.367543] [<ffffffff810ab8de>] rcu_sync_enter+0x7e/0x1a0
[ 701.374090] [<ffffffff8107b0b2>] ? __might_sleep+0x52/0x90
[ 701.380637] [<ffffffff8109c9be>] percpu_down_write+0x1e/0xf0
[ 701.387379] [<ffffffff8107b0b2>] ? __might_sleep+0x52/0x90
[ 701.393928] [<ffffffff811a868b>] freeze_super+0xab/0x1b0
[ 701.400281] [<ffffffff811b7c0d>] do_vfs_ioctl+0x29d/0x560
[ 701.406737] [<ffffffff811aae7e>] ? SYSC_newfstat+0x2e/0x40
[ 701.413283] [<ffffffff811b7f62>] SyS_ioctl+0x92/0xa0
[ 701.419249] [<ffffffff8163c357>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 701.426773] ---[ end trace fc087420ac1d8f19 ]---
[ 701.432273] XXX: ffff880079ab9308 gp=1 cnt=0 cb=0

Regards,
Nikolay


>
> Oleg.
>