Re: possible deadlock in __wake_up_common_lock

From: Peter Zijlstra
Date: Tue Jan 08 2019 - 08:09:19 EST


On Wed, Jan 02, 2019 at 01:51:01PM +0100, Vlastimil Babka wrote:

> > syz-executor0/8529 is trying to acquire lock:
> > 000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
> > __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120
>
> From the backtrace at the end of report I see it's coming from
>
> > wakeup_kswapd+0x5f0/0x930 mm/vmscan.c:3982
> > steal_suitable_fallback+0x538/0x830 mm/page_alloc.c:2217
>
> This wakeup_kswapd is new due to Mel's 1c30844d2dfe ("mm: reclaim small
> amounts of memory when an external fragmentation event occurs") so CC Mel.

Right; and I see Mel already has a fix for that.

> > the existing dependency chain (in reverse order) is:
> >
> > -> #4 (&(&zone->lock)->rlock){-.-.}:
> > __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
> > _raw_spin_lock_irqsave+0x99/0xd0 kernel/locking/spinlock.c:152
> > rmqueue mm/page_alloc.c:3082 [inline]
> > get_page_from_freelist+0x9eb/0x52a0 mm/page_alloc.c:3491
> > __alloc_pages_nodemask+0x4f3/0xde0 mm/page_alloc.c:4529
> > __alloc_pages include/linux/gfp.h:473 [inline]
> > alloc_page_interleave+0x25/0x1c0 mm/mempolicy.c:1988
> > alloc_pages_current+0x1bf/0x210 mm/mempolicy.c:2104
> > alloc_pages include/linux/gfp.h:509 [inline]
> > depot_save_stack+0x3f1/0x470 lib/stackdepot.c:260
> > save_stack+0xa9/0xd0 mm/kasan/common.c:79
> > set_track mm/kasan/common.c:85 [inline]
> > kasan_kmalloc+0xcb/0xd0 mm/kasan/common.c:482
> > kasan_slab_alloc+0x12/0x20 mm/kasan/common.c:397
> > kmem_cache_alloc+0x130/0x730 mm/slab.c:3541
> > kmem_cache_zalloc include/linux/slab.h:731 [inline]
> > fill_pool lib/debugobjects.c:134 [inline]
> > __debug_object_init+0xbb8/0x1290 lib/debugobjects.c:379
> > debug_object_init lib/debugobjects.c:431 [inline]
> > debug_object_activate+0x323/0x600 lib/debugobjects.c:512
> > debug_timer_activate kernel/time/timer.c:708 [inline]
> > debug_activate kernel/time/timer.c:763 [inline]
> > __mod_timer kernel/time/timer.c:1040 [inline]
> > mod_timer kernel/time/timer.c:1101 [inline]
> > add_timer+0x50e/0x1490 kernel/time/timer.c:1137
> > __queue_delayed_work+0x249/0x380 kernel/workqueue.c:1533
> > queue_delayed_work_on+0x1a2/0x1f0 kernel/workqueue.c:1558
> > queue_delayed_work include/linux/workqueue.h:527 [inline]
> > schedule_delayed_work include/linux/workqueue.h:628 [inline]
> > start_dirtytime_writeback+0x4e/0x53 fs/fs-writeback.c:2043
> > do_one_initcall+0x145/0x957 init/main.c:889
> > do_initcall_level init/main.c:957 [inline]
> > do_initcalls init/main.c:965 [inline]
> > do_basic_setup init/main.c:983 [inline]
> > kernel_init_freeable+0x4c1/0x5af init/main.c:1136
> > kernel_init+0x11/0x1ae init/main.c:1056
> > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
> >
> > -> #3 (&base->lock){-.-.}:

However I really, _really_ hate that dependency. We really should not
get memory allocations under rq->lock.

We seem to avoid this for the existing hrtimer usage, because of
hrtimer_init() doing: debug_init() -> debug_hrtimer_init() ->
debug_object_init().

But that isn't done for the (PSI) schedule_delayed_work() thing for some
raisin; even though: group_init() does INIT_DELAYED_WORK() ->
__INIT_DELAYED_WORK() -> __init_timer() -> init_timer_key() ->
debug_init() -> debug_timer_init() -> debug_object_init().

But _somehow_ that isn't doing it.

Now debug_object_activate() has this case:

if (descr->is_static_object && descr->is_static_object(addr)) {
debug_object_init()

which does an debug_object_init() for static allocations, which brings
us to:

static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu);
static struct psi_group psi_system = {

But that _should_ get initialized by psi_init(), which is called from
sched_init() which _should_ be waaay before do_basic_setup().

Something goes wobbly.. but I'm not seeing it.