Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()

From: Will Deacon
Date: Fri Jun 03 2016 - 08:47:28 EST


Hi Peter,

On Thu, Jun 02, 2016 at 11:51:19PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 06:57:00PM +0100, Will Deacon wrote:
> > > +++ b/include/asm-generic/qspinlock.h
> > > @@ -28,30 +28,13 @@
> > > */
> > > static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
> > > {
> > > + /*
> > > + * See queued_spin_unlock_wait().
> > > *
> > > + * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
> > > + * isn't immediately observable.
> > > */
> > > - smp_mb();
> > > + return !!atomic_read(&lock->val);
> > > }
> >
> > I'm failing to keep up here :(
> >
> > The fast-path code in queued_spin_lock is just an atomic_cmpxchg_acquire.
> > If that's built out of LL/SC instructions, then why don't we need a barrier
> > here in queued_spin_is_locked?
> >
> > Or is the decision now that only spin_unlock_wait is required to enforce
> > this ordering?
>
> (warning: long, somewhat rambling, email)

Thanks for taking the time to write this up. Comments inline.

> You're talking about the smp_mb() that went missing?

Right -- I think you still need it.

> So that wasn't the reason that smp_mb() existed..... but that makes the
> atomic_foo_acquire() things somewhat hard to use, because I don't think
> we want to unconditionally put the smp_mb() in there just in case.
>
> Now, the normal atomic_foo_acquire() stuff uses smp_mb() as per
> smp_mb__after_atomic(), its just ARM64 and PPC that go all 'funny' and
> need this extra barrier. Blergh. So lets shelf this issue for a bit.

Hmm... I certainly plan to get qspinlock up and running for arm64 in the
near future, so I'm not keen on shelving it for very long.

> Let me write some text to hopefully explain where it did come from and
> why I now removed it.
>
>
> So the spin_is_locked() correctness issue comes from something like:
>
> CPU0 CPU1
>
> global_lock(); local_lock(i)
> spin_lock(&G) spin_lock(&L[i])
> for (i) if (!spin_is_locked(&G)) {
> spin_unlock_wait(&L[i]); smp_acquire__after_ctrl_dep();
> return;
> }
> /* deal with fail */
>
> Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such
> that there is exclusion between the two critical sections.

Yes, and there's also a version of this where CPU0 is using spin_is_locked
(see 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()").

> The load from spin_is_locked(&G) is constrained by the ACQUIRE from
> spin_lock(&L[i]), and similarly the load(s) from spin_unlock_wait(&L[i])
> are constrained by the ACQUIRE from spin_lock(&G).
>
> Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.
>
> Given a simple (SC) test-and-set spinlock the above is fairly straight
> forward and 'correct', right?

Well, the same issue that you want to shelve can manifest here with
test-and-set locks too, allowing the spin_is_locked(&G) to be speculated
before the spin_lock(&L[i]) has finished taking the lock on CPU1.

Even ignoring that, I'm not convinced this would work for test-and-set
locks without barriers in unlock_wait and is_locked. For example, a
cut-down version of your test looks like:

CPU0: CPU1:
LOCK x LOCK y
Read y Read x

and you don't want the reads to both return "unlocked".

Even on x86, I think you need a fence here:

X86 lock
{
}
P0 | P1 ;
MOV EAX,$1 | MOV EAX,$1 ;
LOCK XCHG [x],EAX | LOCK XCHG [y],EAX ;
MOV EBX,[y] | MOV EBX,[x] ;
exists
(0:EAX=0 /\ 0:EBX=0 /\ 1:EAX=0 /\ 1:EBX=0)

is permitted by herd.

> Now, the 'problem' with qspinlock is that one possible acquire path goes
> like (there are more, but this is the easiest):
>
> smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
> clear_pending_set_locked(lock);
>
> And one possible implementation of clear_pending_set_locked() is:
>
> WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
>
> IOW, we load-acquire the locked byte until its cleared, at which point
> we know the pending byte to be 1. Then we consider the lock owned by us
> and issue a regular unordered store to flip the pending and locked
> bytes.
>
> Normal mutual exclusion is fine with this, no new pending can happen
> until this store becomes visible at which time the locked byte is
> visibly taken.
>
> This unordered store however, can be delayed (store buffer) such that
> the loads from spin_unlock_wait/spin_is_locked can pass up before it
> (even on TSO arches).

Right, and this is surprisingly similar to the LL/SC problem imo.

> _IF_ spin_unlock_wait/spin_is_locked only look at the locked byte, this
> is a problem because at that point the crossed store-load pattern
> becomes uncrossed and we loose our guarantee. That is, what used to be:
>
> [S] G.locked = 1 [S] L[i].locked = 1
> [MB] [MB]
> [L] L[i].locked [L] G.locked
>
> becomes:
>
> [S] G.locked = 1 [S] L[i].locked = 1
> [L] L[i].locked [L] G.locked
>
> Which we can reorder at will and bad things follow.
>
> The previous fix for this was to include an smp_mb() in both
> spin_is_locked() and spin_unlock_wait() to restore that ordering.
>
>
> So at this point spin_is_locked() looks like:
>
> smp_mb();
> while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
> cpu_relax();

I have something similar queued for arm64's ticket locks.

> But for spin_unlock_wait() there is a second correctness issue, namely:
>
> CPU0 CPU1
>
> flag = set;
> smp_mb(); spin_lock(&l)
> spin_unlock_wait(&l); if (flag)
> /* bail */
>
> /* add to lockless list */
> spin_unlock(&l);
>
> /* iterate lockless list */
>
>
> Which ensures that CPU1 will stop adding bits to the list and CPU0 will
> observe the last entry on the list (if spin_unlock_wait() had ACQUIRE
> semantics etc..)
>
> This however, is still broken.. nothing ensures CPU0 sees l.locked
> before CPU1 tests flag.

Yup.

> So while we fixed the first correctness case (global / local locking as
> employed by ipc/sem and nf_conntrack) this is still very much broken.
>
> My patch today rewrites spin_unlock_wait() and spin_is_locked() to rely
> on more information to (hopefully -- I really need sleep) fix both.
>
> The primary observation is that even though the l.locked store is
> delayed, there has been a prior atomic operation on the lock word to
> register the contending lock (in the above scenario, set the pending
> byte, in the other paths, queue onto the tail word).
>
> This means that any load passing the .locked byte store, must at least
> observe that state.

That's what I'm not sure about. Just because there was an atomic operation
writing that state, I don't think it means that it's visible to a normal
load.

Will