Re: BUG: spinlock bad magic on CPU#0, migration/0/9

From: Davidlohr Bueso
Date: Thu Feb 12 2015 - 14:59:39 EST


On Thu, 2015-02-12 at 20:10 +0100, Nicholas Mc Guire wrote:
> On Thu, 12 Feb 2015, Oleg Nesterov wrote:
>
> > On 02/12, Oleg Nesterov wrote:
> > > On 02/11, Davidlohr Bueso wrote:
> > > >
> > > > On Wed, 2015-02-11 at 16:34 -0800, Paul E. McKenney wrote:
> > > > > Hello!
> > > > >
> > > > > Did an earlier-than-usual port of v3.21 patches to post-v3.19, and
> > > > > hit the following on x86_64. This happened after about 15 minutes of
> > > > > rcutorture. In contrast, I have been doing successful 15-hour runs
> > > > > on v3.19. I will check reproducibility and try to narrow it down.
> > > > > Might this be a duplicate of the bug that Raghavendra posted a fix for?
> > > > >
> > > > > Anyway, this was on 3e8c04eb1174 (Merge branch 'for-3.20' of
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata).
> > > > >
> > > > > [ 837.287011] BUG: spinlock bad magic on CPU#0, migration/0/9
> > > > > [ 837.287013] lock: 0xffff88001ea0fe80, .magic: ffffffff, .owner: g?<81>????/0, .owner_cpu: -42
> > > > > [ 837.287013] CPU: 0 PID: 9 Comm: migration/0 Not tainted 3.19.0+ #1
> > > > > [ 837.287013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > > > > [ 837.287013] ffff88001ea0fe80 ffff88001ea0bc78 ffffffff818f6f4b ffffffff810a5a51
> > > > > [ 837.287013] ffffffff81e500e0 ffff88001ea0bc98 ffffffff818f3755 ffff88001ea0fe80
> > > > > [ 837.287013] ffffffff81ca4396 ffff88001ea0bcb8 ffffffff818f377b ffff88001ea0fe80
> > > > > [ 837.287013] Call Trace:
> > > > > [ 837.287013] [<ffffffff818f6f4b>] dump_stack+0x45/0x57
> > > > > [ 837.287013] [<ffffffff810a5a51>] ? console_unlock+0x1f1/0x4c0
> > > > > [ 837.287013] [<ffffffff818f3755>] spin_dump+0x8b/0x90
> > > > > [ 837.287013] [<ffffffff818f377b>] spin_bug+0x21/0x26
> > > > > [ 837.287013] [<ffffffff8109923c>] do_raw_spin_unlock+0x5c/0xa0
> > > > > [ 837.287013] [<ffffffff81902587>] _raw_spin_unlock_irqrestore+0x27/0x50
> > > > > [ 837.287013] [<ffffffff8108f0a1>] complete+0x41/0x50
> > > >
> > > > We did have some recent changes in completions:
> > > >
> > > > 7c34e318 (sched/completion: Add lock-free checking of the blocking case)
> > > > de30ec47 (sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state)
> > > >
> > > > The second one being more related (although both appear to make sense).
> > > > Perhaps some subtle implication in the completion_done side that
> > > > disappeared with the spinlock?
> > >
> > > At first glance both changes look suspicious.
> >
> > No, sorry, only the 2nd one.
> >
> > > Unless at least document how
> > > you can use these helpers.
> > >
> > > Consider this code:
> > >
> > > void xxx(void)
> > > {
> > > struct completion c;
> > >
> > > init_completion(&c);
> > >
> > > expose_this_completion(&c);
> > >
> > > while (!completion_done(&c)
> > > schedule_timeout_uninterruptible(1);
>
> But that would not break due to the change - even if completion_done() had a
> problem - complete_done() is not consuming x->done it is only checking it?

There is for instance contention to consider. When this happens:

CPU0 CPU1

complete(x)
spining_lock_irqsave()
completion_done(x)
load x->done;

x->done++;
wakeup

Before the change in completion_done() we would respect the order of the
completion checks with the spinlock ticket, so you wouldn't run into
this issue. The impact would obviously depend how the waiting for
completion is done, for instance issuing more sleeps when the lock
becomes very contended.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/