Re: kernel deadlock

From: John Stultz
Date: Mon Sep 09 2013 - 16:30:01 EST


On 09/04/2013 01:11 AM, Gerlando Falauto wrote:
> Hi John,
>
> On 09/03/2013 07:26 PM, John Stultz wrote:
>> On 09/03/2013 07:57 AM, Gerlando Falauto wrote:
>>> Hi,
>>>
>>> I tried again from scratch, so let me recap the whole situation, so we
>>> can all view it from the same standpoint. This should make the problem
>>> easier to see and reproduce.
>>>
>>> I can confirm that running a stock 3.10 kernel with HRTICK enabled:
>> [snip]
>>> makes the following program (and the whole board, as a matter of fact)
>>> hang with no further notice:
>>>
>> [snip]
>>> If I then revert everything up to (and including) the offending commit
>>> (mind the '~'):
>>>
>>> $ git log --oneline ...06c017f~ -- kernel/time/timekeeping.c
>>> kernel/time/ntp.c | cut -f1 -d' ' | xargs git revert
>>>
>>> The problem disappears.
>>>
>>> If I then cherry-pick again the offending commit:
>>>
>>> $ git cherry-pick 06c017f; git log -1
>>>
>>> commit 06c017fdd4dc48451a29ac37fc1db4a3f86b7f40
>>> Author: John Stultz <john.stultz@xxxxxxxxxx>
>>> Date: Fri Mar 22 11:37:28 2013 -0700
>>>
>>> timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
>>>
>> [snip]
>>> And as soon as I also cherry-pick (notice there is another commit in
>>> between, which seems not to be relevant on this matter):
>>>
>>> $ git cherry-pick a076b2146fabb0894cae5e0189a8ba3f1502d737; git show
>>>
>>> commit a076b2146fabb0894cae5e0189a8ba3f1502d737
>>> Author: John Stultz <john.stultz@xxxxxxxxxx>
>>> Date: Fri Mar 22 11:52:03 2013 -0700
>>>
>>> ntp: Remove ntp_lock, using the timekeeping locks to protect ntp
>>> state
>>>
>> [snip]
>>>
>>> I end up in the situation where the system hangs completely and NO
>>> deadlock report whatsoever is output.
>>>
>>> So it looks like 06c017fdd4dc48451a29ac37fc1db4a3f86b7f40 introduces
>>> the deadlock, while a076b2146fabb0894cae5e0189a8ba3f1502d737 cares to
>>> hide the report.
>>>
>>> Notice how I tested the above on an ARM board; on PowerPC I get
>>> similar results, although I am not able to see the deadlock report
>>> under any circumstances (enabling CONFIG_PROVE_LOCKING, which is the
>>> flag that triggers the deadlock report, causes the kernel to hang at
>>> startup even on a vanilla 3.10 kernel).
>>>
>>> John, could you please confirm whether you're at least able to
>>> reproduce it somehow?
>>
>> Thanks again for the detailed notes. I've so far been unable to
>> reproduce this, but I'm still looking at it.

So I think I've managed to finally reproduce this and hunt it down.

With Peter's "sched: Fix HRTICK" patch and HRTICK enabled, I found I
could trigger a hard hang at boot on my x86_64 kvm system. sysrq didn't
function, so I checked out info cpus and that pointed to both cpus being
in ktime_get() and ktime_get_update_offsets(), which suggested a
seqcount deadlock (basically calling something that reads the seqlock
while we hold the write on it).

Unfortunately the seqlock/seqcount infrastructure doesn't support
lockdep, so I added some debug code to take and release the
timekeeper_lock in every function that does a read on the timekeeper_seq.

This popped up with:
[ 15.846386] ======================================================
[ 15.847174] [ INFO: possible circular locking dependency detected ]
[ 15.847959] 3.10.0+ #30 Not tainted
[ 15.848402] -------------------------------------------------------
[ 15.849182] ntpd/4062 is trying to acquire lock:
[ 15.849765] (&(&pool->lock)->rlock){..-...}, at:
[<ffffffff810aa9b5>] __queue_work+0x145/0x480
[ 15.850051]
[ 15.850051] but task is already holding lock:
[ 15.850051] (timekeeper_lock){-.-.-.}, at: [<ffffffff810df6df>]
do_adjtimex+0x7f/0x100
[ 15.850051]
[ 15.850051] which lock already depends on the new lock.
[ 15.850051]
[ 15.850051]
[ 15.850051] the existing dependency chain (in reverse order) is:
[ 15.850051]
-> #2 (timekeeper_lock){-.-.-.}:
[ 15.850051] [<ffffffff810eee7e>] lock_acquire+0x9e/0x1e0
[ 15.850051] [<ffffffff81900983>] _raw_spin_lock_irqsave+0x53/0x90
[ 15.850051] [<ffffffff810dde62>] ktime_get+0x32/0x110
[ 15.850051] [<ffffffff810c0cd0>] hrtick_start+0x20/0x90
[ 15.850051] [<ffffffff810c8fb3>] hrtick_start_fair+0x83/0xd0
[ 15.850051] [<ffffffff810c94ce>] hrtick_update+0x5e/0x60
[ 15.850051] [<ffffffff810cc5a1>] enqueue_task_fair+0x961/0xfd0
[ 15.850051] [<ffffffff810c0c11>] enqueue_task+0x61/0x70
[ 15.850051] [<ffffffff810c101e>] activate_task+0x1e/0x20
[ 15.850051] [<ffffffff818fedf9>] __schedule+0x799/0xa10
[ 15.850051] [<ffffffff818ff094>] schedule+0x24/0x70
[ 15.850051] [<ffffffff818fc84c>] schedule_timeout+0x16c/0x2c0
[ 15.850051] [<ffffffff818fe60d>] io_schedule_timeout+0x9d/0xf0
[ 15.850051] [<ffffffff818ff74f>] wait_for_completion_io+0xaf/0x120
[ 15.850051] [<ffffffff813468ec>] blk_execute_rq+0x8c/0xe0
[ 15.850051] [<ffffffff8157b23f>] scsi_execute+0xdf/0x170
[ 15.850051] [<ffffffff8157b365>] scsi_execute_req_flags+0x95/0x110
[ 15.850051] [<ffffffff8158c3b4>] sr_check_events+0xa4/0x2e0
[ 15.850051] [<ffffffff815e0393>] cdrom_check_events+0x13/0x30
[ 15.850051] [<ffffffff8158bde4>] sr_block_check_events+0x34/0x50
[ 15.850051] [<ffffffff81349dd7>] disk_check_events+0x57/0x180
[ 15.850051] [<ffffffff81349f17>] disk_events_workfn+0x17/0x20
[ 15.850051] [<ffffffff810ab777>] process_one_work+0x1e7/0x660
[ 15.850051] [<ffffffff810ac069>] worker_thread+0x119/0x370
[ 15.850051] [<ffffffff810b4376>] kthread+0xd6/0xe0
[ 15.850051] [<ffffffff81901fac>] ret_from_fork+0x7c/0xb0
[ 15.850051]
-> #1 (&p->pi_lock){-.-.-.}:
[ 15.850051] [<ffffffff810eee7e>] lock_acquire+0x9e/0x1e0
[ 15.850051] [<ffffffff81900983>] _raw_spin_lock_irqsave+0x53/0x90
[ 15.850051] [<ffffffff810c436c>] try_to_wake_up+0x2c/0x300
[ 15.850051] [<ffffffff810c467e>] wake_up_process+0x1e/0x40
[ 15.850051] [<ffffffff810a89e2>] start_worker+0x22/0x30
[ 15.850051] [<ffffffff810a9d98>] create_and_start_worker+0x48/0x90
[ 15.850051] [<ffffffff820d89a8>] init_workqueues+0x1b8/0x380
[ 15.850051] [<ffffffff820c0e94>] do_one_initcall+0x7b/0x114
[ 15.850051] [<ffffffff820c0fc6>] kernel_init_freeable+0x99/0x1c3
[ 15.850051] [<ffffffff818e2b59>] kernel_init+0x9/0xf0
[ 15.850051] [<ffffffff81901fac>] ret_from_fork+0x7c/0xb0
[ 15.850051]
-> #0 (&(&pool->lock)->rlock){..-...}:
[ 15.850051] [<ffffffff810ed20e>] __lock_acquire+0x1dce/0x1ee0
[ 15.850051] [<ffffffff810eee7e>] lock_acquire+0x9e/0x1e0
[ 15.850051] [<ffffffff819008f1>] _raw_spin_lock+0x41/0x80
[ 15.850051] [<ffffffff810aa9b5>] __queue_work+0x145/0x480
[ 15.850051] [<ffffffff810aadba>] __queue_delayed_work+0xaa/0x1a0
[ 15.850051] [<ffffffff810ac4de>] queue_delayed_work_on+0x8e/0xa0
[ 15.850051] [<ffffffff810dfe68>] __do_adjtimex+0x158/0x560
[ 15.850051] [<ffffffff810df703>] do_adjtimex+0xa3/0x100
[ 15.850051] [<ffffffff81090f7b>] SYSC_adjtimex+0x3b/0x80
[ 15.850051] [<ffffffff810912b9>] SyS_adjtimex+0x9/0x10
[ 15.850051] [<ffffffff81902264>] tracesys+0xdd/0xe2
[ 15.850051]
[ 15.850051] other info that might help us debug this:
[ 15.850051]
[ 15.850051] Chain exists of:
&(&pool->lock)->rlock --> &p->pi_lock --> timekeeper_lock
[ 15.850051] Possible unsafe locking scenario:
[ 15.850051]
[ 15.850051] CPU0 CPU1
[ 15.850051] ---- ----
[ 15.850051] lock(timekeeper_lock);
[ 15.850051] lock(&p->pi_lock);
[ 15.850051] lock(timekeeper_lock);
[ 15.850051] lock(&(&pool->lock)->rlock);
[ 15.850051]
[ 15.850051] *** DEADLOCK ***
[ 15.850051]
[ 15.850051] 1 lock held by ntpd/4062:
[ 15.850051] #0: (timekeeper_lock){-.-.-.}, at: [<ffffffff810df6df>]
do_adjtimex+0x7f/0x100
[ 15.850051]
[ 15.850051] stack backtrace:
[ 15.850051] CPU: 0 PID: 4062 Comm: ntpd Not tainted 3.10.0+ #30
[ 15.850051] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 15.850051] ffffffff82354260 ffff880000239b78 ffffffff818f6cfe
ffff880000239bb8
[ 15.850051] ffffffff818f2f51 ffff880000239c40 ffff88000002ea78
ffff88000002ea40
[ 15.850051] ffff88000002e310 0000000000000000 0000000000000001
ffff880000239ca8
[ 15.850051] Call Trace:
[ 15.850051] [<ffffffff818f6cfe>] dump_stack+0x19/0x1b
[ 15.850051] [<ffffffff818f2f51>] print_circular_bug+0x1f9/0x208
[ 15.850051] [<ffffffff810ed20e>] __lock_acquire+0x1dce/0x1ee0
[ 15.850051] [<ffffffff810e9c2d>] ? trace_hardirqs_off+0xd/0x10
[ 15.850051] [<ffffffff810c6fad>] ? sched_clock_local+0x1d/0x90
[ 15.850051] [<ffffffff810eee7e>] lock_acquire+0x9e/0x1e0
[ 15.850051] [<ffffffff810aa9b5>] ? __queue_work+0x145/0x480
[ 15.850051] [<ffffffff819008f1>] _raw_spin_lock+0x41/0x80
[ 15.850051] [<ffffffff810aa9b5>] ? __queue_work+0x145/0x480
[ 15.850051] [<ffffffff810aa9b5>] __queue_work+0x145/0x480
[ 15.850051] [<ffffffff810aadba>] __queue_delayed_work+0xaa/0x1a0
[ 15.850051] [<ffffffff810ac4de>] queue_delayed_work_on+0x8e/0xa0
[ 15.850051] [<ffffffff810dfe68>] __do_adjtimex+0x158/0x560
[ 15.850051] [<ffffffff810df703>] do_adjtimex+0xa3/0x100
[ 15.850051] [<ffffffff81090f7b>] SYSC_adjtimex+0x3b/0x80
[ 15.850051] [<ffffffff810912b9>] SyS_adjtimex+0x9/0x10
[ 15.850051] [<ffffffff81902264>] tracesys+0xdd/0xe2



So yea, thanks to the lockdep code we can see there's a three-lock deadlock.

And as you discovered, it connected to adjtimex and the timekeeping
locking changes. Specifically that we're now calling
schedule_delayed_work() under a timekeeper_seq write, and that somehow
ends up calling back to a timekeeping accessor function.

Now, I'm still in the dark as to why HRTICK exposes this, but seems like
the following patch should resolve the issue (and quiets the lockdep
warnings in my testing). Let me know how it works for you!

Ingo: This makes me think we really should have some lockdep smarts
added to seqlock/seqcount structures. Is there something already
discovered thats preventing this, or has this just not yet been tried?

thanks
-john