Re: sched/fair: Hard lockup from idle_balance()/task_numa_migrate() race

From: Ed Swierk
Date: Tue Nov 07 2017 - 12:26:17 EST


On Mon, Oct 16, 2017 at 4:11 PM, Ed Swierk <eswierk@xxxxxxxxxxxxxxxxxx> wrote:
> To recap: a dual-socket Xeon (E5 v4) server system had been running a
> bunch of KVM workloads just fine for over 6 weeks. Suddenly hard
> lockups occurred on cpu 13 in task_numa_migrate(), and cpu 0 in
> idle_balance(). That condition persisted until the system was rebooted
> after a couple of hours, and meanwhile all manner of soft lockups and
> even SATA errors appeared.
>
> [3851435.777762] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
> [3851435.784582] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
> [3851435.874884] irq event stamp: 3828060294
> [3851435.879398] hardirqs last enabled at (3828060293): [<ffffffff810db86c>] ffffffff810db86c
> [3851435.888810] hardirqs last disabled at (3828060294): [<ffffffff8170ff74>] ffffffff8170ff74
> [3851435.898254] softirqs last enabled at (3828060156): [<ffffffff8107f31e>] ffffffff8107f31e
> [3851435.907666] softirqs last disabled at (3828060139): [<ffffffff8107f648>] ffffffff8107f648
> [3851435.917094] CPU: 13 PID: 2146 Comm: pocviexg Not tainted 4.4.52-grsec #1
> [3851435.925252] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
> [3851435.937473] task: ffff881c9ff35c00 ti: ffff881c9ff372b0 task.ti: ffff881c9ff372b0
> [3851435.946104] RIP: 0010:[<ffffffff810ad854>] [<ffffffff810ad854>] try_to_wake_up+0xe4/0x500
> [3851435.955614] RSP: 0000:ffffc90016beba40 EFLAGS: 00000002
> [3851435.961780] RAX: 0000000000000000 RBX: ffff88102cc8e3d8 RCX: 0000000000000001
> [3851435.970016] RDX: 0000000000000001 RSI: 00000000000001e4 RDI: 0000000000000000
> [3851435.978259] RBP: ffffc90016beba80 R08: 0000000000000000 R09: 0000000000000001
> [3851435.986493] R10: 00000000000000e0 R11: ffff881c9ff35c00 R12: 0000000000000082
> [3851435.994727] R13: ffff88102cc8dc00 R14: 00000000001d3f00 R15: ffffffff81f25060
> [3851436.002962] FS: 00006e79227fc700(0000) GS:ffff882034000000(0000) knlGS:0000000000000000
> [3851436.012266] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [3851436.018920] CR2: 00006e79227fba38 CR3: 000000201d2fe000 CR4: 00000000003626f0
> [3851436.027155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [3851436.035391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [3851436.043626] Stack:
> [3851436.046098] ffff8820341ce620 ffff8820341ce608 0000000000000000 ffff8820341ce600
> [3851436.054695] ffffc90016bebb48 000000000000000d 0000000000000000 ffff881033fce600
> [3851436.063259] ffffc90016beba90 ffffffff810adc80 ffffc90016bebbe8 ffffffff8112c830
> [3851436.071868] Call Trace:
> [3851436.074862] [<ffffffff810adc80>] wake_up_process+0x10/0x20 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1997
> [3851436.081317] [<ffffffff8112c830>] stop_two_cpus+0x1b0/0x280 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L235
> [3851436.087778] [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
> [3851436.094237] [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
> [3851436.100715] [<ffffffff810ad560>] ? __migrate_swap_task.part.94+0x70/0x70
> [3851436.108539] [<ffffffff810ade01>] migrate_swap+0xf1/0x1f0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1408
> [3851436.114800] [<ffffffff810b9d55>] task_numa_migrate+0x345/0xb60 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1572
> [3851436.121656] [<ffffffff811092e3>] ? futex_wait_queue_me+0xf3/0x160
> [3851436.128801] [<ffffffff810ba5e3>] numa_migrate_preferred+0x73/0x80 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1597
> [3851436.135936] [<ffffffff810bb944>] task_numa_fault+0x7a4/0xee0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L2137
> [3851436.142604] [<ffffffff810bafe0>] ? should_numa_migrate_memory+0x50/0x130
> [3851436.150437] [<ffffffff811b2db6>] handle_mm_fault+0xfd6/0x1e10 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/mm/memory.c#L3503
> [3851436.157189] [<ffffffff811b1e28>] ? handle_mm_fault+0x48/0x1e10
> [3851436.164051] [<ffffffff810a85aa>] ? ___might_sleep+0x1ea/0x220
> [3851436.170804] [<ffffffff8106405f>] __do_page_fault+0x1bf/0x460
> [3851436.177475] [<ffffffff8110ca44>] ? sys_futex+0x74/0x230
> [3851436.183642] [<ffffffff8106432c>] do_page_fault+0x2c/0x40
> [3851436.189909] [<ffffffff817128b8>] page_fault+0x28/0x30
> [3851436.195892] Code: 0f 85 e0 02 00 00 41 8b 85 b8 16 00 00 89 45 d0 41 8b 7d 4c 49 c7 c6 00 3f 1d 00 85 ff 0f 85 68 01 00 00 41 8b 4d 30 85 c9 74 0a <f3> 90 41 8b 55 30 85 d2 75 f6 49 8b 45 00 a8 02 b8 00 00 00 00
>
> [3851438.906435] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
> [3851438.913158] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
> [3851439.003452] irq event stamp: 25935634
> [3851439.007769] hardirqs last enabled at (25935633): [<ffffffff81710341>] ffffffff81710341
> [3851439.017010] hardirqs last disabled at (25935634): [<ffffffff8170ff74>] ffffffff8170ff74
> [3851439.026226] softirqs last enabled at (25935520): [<ffffffff8107f31e>] ffffffff8107f31e
> [3851439.035441] softirqs last disabled at (25935445): [<ffffffff8107f648>] ffffffff8107f648
> [3851439.044657] CPU: 0 PID: 10 Comm: migration/0 Not tainted 4.4.52-grsec #1
> [3851439.052892] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
> [3851439.065111] task: ffff88102cc8dc00 ti: ffff88102cc8f2b0 task.ti: ffff88102cc8f2b0
> [3851439.073747] RIP: 0010:[<ffffffff813c4935>] [<ffffffff813c4935>] delay_tsc+0x25/0x70
> [3851439.082679] RSP: 0000:ffffc900006ebb40 EFLAGS: 00000002
> [3851439.088852] RAX: 000000002591f96b RBX: ffff8820341ce608 RCX: 0000000000000018
> [3851439.101462] RDX: 001cab2e00000000 RSI: 001cab2e2591f96b RDI: 0000000000000001
> [3851439.109695] RBP: ffffc900006ebb40 R08: 0000000000000000 R09: 0000000000000000
> [3851439.117950] R10: 0000000000000038 R11: 0000000000000000 R12: 0000000021d37081
> [3851439.126198] R13: 0000000000000001 R14: 000000007d234b28 R15: 0000000000000000
> [3851439.134440] FS: 0000000000000000(0000) GS:ffff881033e00000(0000) knlGS:0000000000000000
> [3851439.143741] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [3851439.150393] CR2: 000000c420148008 CR3: 0000000001f36000 CR4: 00000000003626f0
> [3851439.158627] Stack:
> [3851439.161096] ffffc900006ebb50 ffffffff813c487a ffffc900006ebb80 ffffffff810dba87
> [3851439.169669] ffff8820341ce608 ffff8820341ce620 0000000000000086 00000000000006ae
> [3851439.178228] ffffc900006ebbb0 ffffffff817101a2 ffffffff8112c1cb ffff8820341ce600
> [3851439.186825] Call Trace:
> [3851439.189784] [<ffffffff813c487a>] __delay+0xa/0x10
> [3851439.195385] [<ffffffff810dba87>] do_raw_spin_lock+0x87/0x160
> [3851439.202062] [<ffffffff817101a2>] _raw_spin_lock_irqsave+0xa2/0xb0
> [3851439.209238] [<ffffffff8112c1cb>] ? cpu_stop_queue_work+0x2b/0x70
> [3851439.216287] [<ffffffff8112c1cb>] cpu_stop_queue_work+0x2b/0x70 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L89
> [3851439.223149] [<ffffffff8112c92b>] stop_one_cpu_nowait+0x2b/0x30 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L310
> [3851439.229999] [<ffffffff810c1127>] load_balance+0x827/0x960 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L7184
> [3851439.236371] [<ffffffff810eae3d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [3851439.243995] [<ffffffff810c1608>] pick_next_task_fair+0x3a8/0x750 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L5348
> [3851439.251062] [<ffffffff810c14db>] ? pick_next_task_fair+0x27b/0x750
> [3851439.258289] [<ffffffff81708fdd>] __schedule+0x1ad/0xbd0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L3196
> [3851439.264456] [<ffffffff810d461d>] ? trace_hardirqs_on_caller+0x13d/0x1d0
> [3851439.272204] [<ffffffff81709a37>] schedule+0x37/0x80 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L3236
> [3851439.277984] [<ffffffff810a51cb>] smpboot_thread_fn+0x16b/0x1e0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/smpboot.c#L160
> [3851439.284822] [<ffffffff810a5060>] ? sort_range+0x20/0x20
> [3851439.291001] [<ffffffff810a124c>] kthread+0xfc/0x120
> [3851439.296796] [<ffffffff8170adf1>] ? wait_for_completion+0xe1/0x110
> [3851439.303936] [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
> [3851439.311464] [<ffffffff81710aae>] ret_from_fork+0x3e/0x70
> [3851439.317738] [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
> [3851439.325281] Code: 00 5d c3 0f 1f 00 55 65 ff 05 b8 65 c4 7e 48 89 e5 65 44 8b 05 35 48 c4 7e 0f ae e8 0f 31 48 c1 e2 20 48 89 d6 48 09 c6 0f ae e8 <0f> 31 48 c1 e2 20 48 09 d0 48 89 c1 48 29 f1 48 39 cf 76 27 65
>
> [3851450] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
> [3851450] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> [3851450] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
>
> [3851478] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
> [3851478] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> [3851478] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
>
> [3851482] BUG: spinlock lockup suspected on CPU#0, migration/0/10
>
> [3851482] Sending NMI to all CPUs:
> [3851482] NMI backtrace for cpu 0
> [3851483] NMI backtrace for cpu 1
> [3851483] NMI backtrace for cpu 2
> [3851483] NMI backtrace for cpu 3
> [3851483] NMI backtrace for cpu 4
>
> [3851484] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
> [3851484] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
>
> [3851484] NMI backtrace for cpu 5
> [3851484] NMI backtrace for cpu 6
> ...
> [3851489] NMI backtrace for cpu 24
>
> [3851489] ata8.00: qc timeout (cmd 0xec) **
> [3851489] ata7.00: qc timeout (cmd 0xec) **
>
> [3851489] NMI backtrace for cpu 25
> ...
> [3851490] NMI backtrace for cpu 31
>
> [3851499] ata8.00: qc timeout (cmd 0xec) **
> [3851499] ata7.00: qc timeout (cmd 0xec) **
>
> [3851510] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> ...and so on for 2 more hours, until the system was finally rebooted
>
> Based on the addresses in the stack and registers, here's what I think
> happened.
>
> On cpu 13:
>
> - task_numa_fault() calls task_numa_migrate(), which selects the task
> on cpu 0 as the dst_task.
> - migrate_swap() calls stop_two_cpus(), which acquires the cpu_stopper
> locks for the dst_cpu (cpu 0, at 0xffff881033fce600) and src_cpu
> (cpu X, at 0xffff8820341ce600).
> - stop_two_cpus() calls wake_up_process() on the lower-numbered cpu
> first, which has to be cpu 0.
> - wake_up_process() spins until the cpu 0 task (at 0xffff88102cc8dc00)
> is no longer on_cpu.
>
> On cpu 0:
>
> - pick_next_task_fair() calls idle_balance(). According to the "This
> is OK" comment, current is on_cpu at this point.
> - idle_balance() calls load_balance() for dst_cpu 0.
> - load_balance() decides to move a task from cpu X, so calls
> stop_one_cpu_nowait() on cpu X.
> - stop_one_cpu_nowait() spins trying to acquire the cpu_stopper lock
> for cpu X (at 0xffff8820341ce600).
>
> So idle_balance() on cpu 0 is stuck waiting for task_numa_fault() to
> move a task to cpu 0, which is blocked on idle_balance() completing.
>
> Meanwhile, the SATA controller interrupt handler, which is placed on
> cpu 0, doesn't get a chance to run, causing the command failure
> messages from ata7 and ata8.
>
> Also, it appears that task_numa_fault() tries to migrate current, so
> the src_cpu X used by task_numa_migrate() is cpu 13 in this
> case. Though the key issue is that both task_numa_migrate() and
> idle_balance() are trying to stop the same cpu, regardless of whether
> it's the cpu task_numa_migrate() is running on.

Here's another view of the lockup:

cpu 13 cpu 0

task_numa_fault()
numa_migrate_preferred()
task_numa_migrate()
migrate_swap(current,
dst_task on cpu 0)
stop_two_cpus(13, 0)
smpboot_thread_fn()
A: cpu_stop_should_run(0) ret 0
schedule()
pick_next_task_fair()
idle_balance()
load_balance(0)
stop_one_cpu_nowait(13)
cpu_stop_queue_two_works(0, 13)
B: spin_lock(stopper for cpu 0)
C: spin_lock(stopper for cpu 13)
add multi_cpu_stop() to stopper wq
for cpu 0
wake_up_process(cpu_stopper_thread()
on cpu 0)
cpu_stop_queue_work(13)
D: spin_lock(stopper for cpu 13)

[deadlock]

Key conditions:

1. cpu 0 decided to load balance cpu 13 task to it
2. cpu 13 decided to swap its task with cpu 0 task for numa balancing
3. these were interleaved so that A < B and C < D

(If B < A, cpu_stop_should_run(0) on cpu 0 would not have returned
until the numa load balance on cpu 13 had completed, and the cpu 0
task would have set its state to TASK_RUNNING and called
multi_cpu_stop(), instead of calling schedule() and triggering a load
balance. If D < C, the load balance on cpu 0 would have finished
moving the cpu 13 task before the numa migration on cpu 13 got
started, and migrate_swap_stop() would have noticed the move and
bailed out.)

Reproducing the deadlock on demand seems pretty hopeless. At least I
might be able to tip the timing condition (3) in my favor by adding
artificial delays just after A and just before B. But I have no idea
how to induce conditions (1) and (2).

I don't see any differences between 4.4.52-grsec and current upstream
that would prevent the same deadlock: there's still no explicit mutual
exclusion between task_numa_migrate() and load_balance() working on
the same cpus; nor does either one have a way of noticing the other is
in progress and bailing out.

At least stop_one_cpu_nowait() now returns false if the cpu stopper is
disabled (though load_balance() ignores it). We could create a similar
try_stop_one_cpu_nowait() that returns false if the cpu stopper lock
is already owned by another cpu, instead of spinning. Then
load_balance() could bail out in case the same cpus are already being
manipulated by a numa migration. But I have no idea if that's a
desirable way to prevent this deadlock.

Any advice on reproducing this issue in a current kernel, or on fixing
it, would be appreciated.

--Ed