Re: stop_machine() soft lockup

From: Niklas Cassel
Date: Wed Sep 05 2018 - 09:46:00 EST


On Wed, Sep 05, 2018 at 03:14:10PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 05, 2018 at 01:47:49PM +0200, Niklas Cassel wrote:
> > On Wed, Sep 05, 2018 at 10:42:41AM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 04, 2018 at 09:03:22PM +0200, Niklas Cassel wrote:
> > > > Hello Peter,
> > > >
> > > > I'm seeing some lockups when booting linux-next on a db820c arm64 board.
> > > > I've tried to analyze, but I'm currently stuck.
> > >
> > > Please see (should be in your Inbox too):
> > >
> > > https://lkml.kernel.org/r/20180905084158.GR24124@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > I'm sorry if I mislead you by replying to your other mail thread,
> > both of them have timekeeping_notify() in the call trace,
> > but my problem has this call trace:
> >
> > [ 128.747853] wait_for_common+0xe0/0x1a0
> > [ 128.752023] wait_for_completionx+0x28/0x38
> > [ 128.755677] __stop_cpus+0xd4/0xf8
> > [ 128.759837] stop_cpus+0x70/0xa8
> > [ 128.762958] stop_machine_cpuslocked+0x124/0x130
> > [ 128.766345] stop_machine+0x54/0x70
> > [ 128.771373] timekeeping_notify+0x44/0x70
> > [ 128.774158] __clocksource_select+0xa8/0x1d8
> > [ 128.778605] clocksource_done_booting+0x4c/0x64
> > [ 128.782931] do_one_initcall+0x94/0x3f8
> > [ 128.786921] kernel_init_freeable+0x47c/0x528
> > [ 128.790742] kernel_init+0x18/0x110
> > [ 128.795673] ret_from_fork+0x10/0x1c
> >
> >
> > while your other mail thread has this call trace:
> >
> > * stop_machine()
> > * timekeeping_notify()
> > * __clocksource_select()
> > * clocksource_select()
> > * clocksource_watchdog_work()
> >
> >
> > So my problem is not related to the watchdog, I tried your revert anyway,
> > but unfortunately my problem persists.
>
> Oh, right, missed that distinction. And this is new?

I can reproduce it on 4.14.15 so probably not that new.

>
> I'll try and have a look. Lockdep doesn't suggest anything?

Thanks.

Usually, when seeing these soft lockups, no.

However, I just managed to get this lockdep splat when booting without
earlycon and ftrace=irqsoff
I'm not sure that it is related, but I'm planning on looking into it anyway:

[ 70.644248] Bluetooth: hci0: QCA Failed to request file: qca/rampatch_00440302.bin (-11)
[ 70.644299] Bluetooth: hci0: QCA Failed to download patch (-11)
[ 70.644475] cfg80211: failed to load regulatory.db
[ 70.658037] remoteproc remoteproc0: powering up adsp-pil
[ 70.663080] remoteproc remoteproc0: Direct firmware load for adsp.mdt failed with error -2
[ 70.667438]
[ 70.667445] ======================================================
[ 70.667450] WARNING: possible circular locking dependency detected
[ 70.667459] 4.19.0-rc2-next-20180904-00002-gd2e5ea917660-dirty #44 Tainted: G W
[ 70.667464] ------------------------------------------------------
[ 70.667471] kworker/0:1/14 is trying to acquire lock:
[ 70.667478] (____ptrval____) (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0x3d4/0x6a0
[ 70.667512]
[ 70.667512] but task is already holding lock:
[ 70.667517] (____ptrval____) (&port_lock_key){-.-.}, at: msm_uart_irq+0x38/0x710
[ 70.667545]
[ 70.667545] which lock already depends on the new lock.
[ 70.667545]
[ 70.667550]
[ 70.667550] the existing dependency chain (in reverse order) is:
[ 70.667555]
[ 70.667555] -> #2 (&port_lock_key){-.-.}:
[ 70.667585] _raw_spin_lock+0x44/0x58
[ 70.667591] __msm_console_write+0x84/0x1e8
[ 70.667598] msm_console_write+0x64/0x78
[ 70.667608] console_unlock+0x400/0x610
[ 70.667615] register_console+0x290/0x3b8
[ 70.667622] uart_add_one_port+0x4cc/0x4d8
[ 70.667628] msm_serial_probe+0x158/0x1d8
[ 70.667639] platform_drv_probe+0x58/0xa8
[ 70.667645] really_probe+0x280/0x3d8
[ 70.667651] driver_probe_device+0x60/0x148
[ 70.667658] __driver_attach+0x144/0x148
[ 70.667665] bus_for_each_dev+0x84/0xd8
[ 70.667671] driver_attach+0x30/0x40
[ 70.667677] bus_add_driver+0x234/0x2a8
[ 70.667684] driver_register+0x64/0x110
[ 70.667691] __platform_driver_register+0x54/0x60
[ 70.667700] msm_serial_init+0x40/0x70
[ 70.667709] do_one_initcall+0x94/0x3f8
[ 70.667717] kernel_init_freeable+0x47c/0x528
[ 70.667726] kernel_init+0x18/0x110
[ 70.667732] ret_from_fork+0x10/0x1c
[ 70.667737]
[ 70.667737]
[ 70.667737] -> #1 (console_owner){-.-.}:
[ 70.667762] console_unlock+0x298/0x610
[ 70.667769] vprintk_emit+0x110/0x298
[ 70.667776] vprintk_default+0x48/0x58
[ 70.667782] vprintk_func+0x100/0x200
[ 70.667789] printk+0x74/0x94
[ 70.667797] __warn_printk+0x48/0xa0
[ 70.667803] check_flush_dependency+0xf8/0x150
[ 70.667809] __flush_work+0xc0/0x2e0
[ 70.667815] __cancel_work_timer+0x154/0x1c8
[ 70.667821] cancel_delayed_work_sync+0x24/0x30
[ 70.667832] ufshcd_ungate_work+0x30/0x100
[ 70.667838] process_one_work+0x2a0/0x710
[ 70.667844] worker_thread+0x48/0x478
[ 70.667852] kthread+0x134/0x138
[ 70.667858] ret_from_fork+0x10/0x1c
[ 70.667863]
[ 70.667863] -> #0 (&(&pool->lock)->rlock){-.-.}:
[ 70.667890] lock_acquire+0xc0/0x230
[ 70.667897] _raw_spin_lock+0x44/0x58
[ 70.667903] __queue_work+0x3d4/0x6a0
[ 70.667909] queue_work_on+0xc8/0xd0
[ 70.667920] hci_uart_tx_wakeup+0x188/0x228
[ 70.667926] hci_uart_write_wakeup+0x70/0x78
[ 70.667933] ttyport_write_wakeup+0xdc/0xe8
[ 70.667940] tty_port_tty_wakeup+0x28/0x38
[ 70.667946] uart_write_wakeup+0x24/0x38
[ 70.667952] msm_handle_tx_pio+0x94/0x1d0
[ 70.667958] msm_handle_tx+0x10c/0x3c0
[ 70.667964] msm_uart_irq+0x310/0x710
[ 70.667972] __handle_irq_event_percpu+0xb8/0x440
[ 70.667978] handle_irq_event_percpu+0x40/0x98
[ 70.667983] handle_irq_event+0x50/0x80
[ 70.667990] handle_fasteoi_irq+0xc8/0x1a0
[ 70.667998] generic_handle_irq+0x34/0x50
[ 70.668004] __handle_domain_irq+0x8c/0xf8
[ 70.668010] gic_handle_irq+0x84/0x180
[ 70.668016] el1_irq+0xec/0x198
[ 70.668022] console_unlock+0x4e0/0x610
[ 70.668029] vprintk_emit+0x110/0x298
[ 70.668036] dev_vprintk_emit+0x150/0x248
[ 70.668042] dev_printk_emit+0x84/0xa8
[ 70.668048] __dev_printk+0x5c/0xa0
[ 70.668054] _dev_warn+0x74/0x98
[ 70.668062] _request_firmware+0x428/0x5d8
[ 70.668068] request_firmware+0x40/0x50
[ 70.668076] rproc_boot+0xfc/0x480
[ 70.668081] rproc_auto_boot_callback+0x24/0x38
[ 70.668087] request_firmware_work_func+0x50/0x88
[ 70.668094] process_one_work+0x2a0/0x710
[ 70.668100] worker_thread+0x48/0x478
[ 70.668106] kthread+0x134/0x138
[ 70.668113] ret_from_fork+0x10/0x1c
[ 70.668118]
[ 70.668124] Chain exists of:
[ 70.668124] &(&pool->lock)->rlock --> console_owner --> &port_lock_key
[ 70.668124]
[ 70.668155] Possible unsafe locking scenario:
[ 70.668155]
[ 70.668160] CPU0 CPU1
[ 70.668164] ---- ----
[ 70.668169] lock(&port_lock_key);
[ 70.668182] lock(console_owner);
[ 70.668195] lock(&port_lock_key);
[ 70.668209] lock(&(&pool->lock)->rlock);
[ 70.668222]
[ 70.668222] *** DEADLOCK ***
[ 70.668222]
[ 70.668229] 6 locks held by kworker/0:1/14:
[ 70.668234] #0: (____ptrval____) ((wq_completion)"events"){+.+.}, at: process_one_work+0x1f8/0x710
[ 70.668263] #1: (____ptrval____) ((work_completion)(&fw_work->work)){+.+.}, at: process_one_work+0x1f8
/0x710
[ 70.668293] #2: (____ptrval____) (&rproc->lock){+.+.}, at: rproc_boot+0x4c/0x480
[ 70.668322] #3: (____ptrval____) (console_lock){+.+.}, at: vprintk_emit+0x108/0x298
[ 70.668351] #4: (____ptrval____) (&port_lock_key){-.-.}, at: msm_uart_irq+0x38/0x710
[ 70.668379] #5: (____ptrval____) (&hu->proto_lock){.+.+}, at: hci_uart_write_wakeup+0x70/0x78
[ 70.668409]
[ 70.668409] stack backtrace:
[ 70.668418] CPU: 0 PID: 14 Comm: kworker/0:1 Tainted: G W 4.19.0-rc2-next-20180904-00002
-gd2e5ea917660-dirty #44
[ 70.668423] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[ 70.668435] Workqueue: events request_firmware_work_func
[ 70.668446] Call trace:
[ 70.668454] dump_backtrace+0x0/0x200
[ 70.668460] show_stack+0x24/0x30
[ 70.668469] dump_stack+0xac/0xe4
[ 70.668476] print_circular_bug.isra.19+0x1d4/0x2e8
[ 70.668482] __lock_acquire+0x1814/0x1878
[ 70.668489] lock_acquire+0xc0/0x230
[ 70.668495] _raw_spin_lock+0x44/0x58
[ 70.668501] __queue_work+0x3d4/0x6a0
[ 70.668507] queue_work_on+0xc8/0xd0
[ 70.668514] hci_uart_tx_wakeup+0x188/0x228
[ 70.668520] hci_uart_write_wakeup+0x70/0x78
[ 70.668527] ttyport_write_wakeup+0xdc/0xe8
[ 70.668533] tty_port_tty_wakeup+0x28/0x38
[ 70.668539] uart_write_wakeup+0x24/0x38
[ 70.668544] msm_handle_tx_pio+0x94/0x1d0
[ 70.668550] msm_handle_tx+0x10c/0x3c0
[ 70.668557] msm_uart_irq+0x310/0x710
[ 70.668563] __handle_irq_event_percpu+0xb8/0x440
[ 70.668568] handle_irq_event_percpu+0x40/0x98
[ 70.668574] handle_irq_event+0x50/0x80
[ 70.668581] handle_fasteoi_irq+0xc8/0x1a0
[ 70.668588] generic_handle_irq+0x34/0x50
[ 70.668594] __handle_domain_irq+0x8c/0xf8
[ 70.668599] gic_handle_irq+0x84/0x180
[ 70.668606] el1_irq+0xec/0x198
[ 70.668612] console_unlock+0x4e0/0x610
[ 70.668619] vprintk_emit+0x110/0x298
[ 70.668625] dev_vprintk_emit+0x150/0x248
[ 70.668631] dev_printk_emit+0x84/0xa8
[ 70.668637] __dev_printk+0x5c/0xa0
[ 70.668643] _dev_warn+0x74/0x98
[ 70.668649] _request_firmware+0x428/0x5d8
[ 70.668656] request_firmware+0x40/0x50
[ 70.668662] rproc_boot+0xfc/0x480
[ 70.668668] rproc_auto_boot_callback+0x24/0x38
[ 70.668674] request_firmware_work_func+0x50/0x88
[ 70.668680] process_one_work+0x2a0/0x710
[ 70.668686] worker_thread+0x48/0x478
[ 70.668693] kthread+0x134/0x138
[ 70.668699] ret_from_fork+0x10/0x1c


Kind regards,
Niklas