Re: sched: memory corruption on completing completions

From: Linus Torvalds
Date: Wed Feb 04 2015 - 19:17:00 EST


On Wed, Feb 4, 2015 at 3:24 PM, Sasha Levin <sasha.levin@xxxxxxxxxx> wrote:
>
> I now have a theory for why it happens:
>
> Thread A Thread B
> ----------------------------------------------------------
>
> [Enter function]
> DECLARE_COMPLETION_ONSTACK(x)
> wait_for_completion(x)
> complete(x)
> [In complete(x):]
> spin_lock_irqsave(&x->wait.lock, flags);
> x->done++;
> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> [Done waiting, wakes up]
> [Exit function]
> spin_unlock_irqrestore(&x->wait.lock, flags);
>
> So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption
> the stack of thread A.

We have indeed had bugs like that, but it *shouldn't be the case any
more. The hard rule for spinlocks is:

- the last thing *unlock* does is to release the lock with a single
store. After that has happened, it will not touch it, and before that
has happened, nobody else can get the spinlock

which means that you cannot actually get the case you are talking
about (because the "done waiting, wakes up" in thread A happens with
the spinlock held).

That said, as mentioned, we have had bugs in spinlock debugging code
in particular, where the last unlock does other things after it
releases the low-level lock. And you are clearly not using normal
spinlocks, since your error trace has this:

do_raw_spin_unlock+0x417/0x4f0

which means that "do_raw_spin_unlock() is 1250+ bytes in size. A
"normal" do_raw_spin_unlock()" is inlined because it's a single store
operation.

Looks like the debug version of do_raw_spin_unlock() is slightly
larger than that ;) Anyway, apparently that version of
do_raw_spin_unlock isn't just huge, it's also buggy.. Although I
thought we fixed that bug some time ago, and looking at the code it
does

void do_raw_spin_unlock(raw_spinlock_t *lock)
{
debug_spin_unlock(lock);
arch_spin_unlock(&lock->raw_lock);
}

so it *looks* ok in the generic code.

So off to look at the arch code...

And looking at the arch version, I think the paravirtualized code is crap.

It does:

prev = *lock;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);

/* add_smp() is a full mb() */

if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);


which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more. Exactly
because a fast-path lock migth come in, and release the whole data
structure.

As usual, the paravirt code is a horribly buggy heap of crud. Film at 11.

Why did I think we had this bug but already fixed it ? Maybe it's one
of those things that Waiman fixed in his long delayed qspinlock
series? Waiman? Or maybe I just remember the fixes where we changed
from a mutex to a spinlock, which fixed a very similar case for
non-paravirtualized cases.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/