Re: futex: clarification needed with drop_futex_key_refs and memory barriers

From: Petros Koutoupis
Date: Wed Mar 30 2016 - 21:53:33 EST


Peter,

Thank you very much for your input. Please read my comments below...

On Tue, 2016-03-29 at 11:50 +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 08:25:48AM +0200, Mike Galbraith wrote:
> > (futex ordering pop-flare)
> >
> > On Sat, 2016-03-26 at 10:56 -0500, Petros Koutoupis wrote:
> > > I stumbled on an interesting scenario which I am unable to fully explain and I
> > > was hoping to get some other opinions on why this would or wouldn't work.
> > >
> > > In recent testing on a 48-core Haswell arch server, our multi-threaded user space
> > > application was utilizing 60% to 100% more CPU than on our smaller 24-core servers
> > > (running an identical load). After spending a considerable amount of time analyzing
> > > stack dumps and straces it became immediately apparent that those exact threads
> > > operating with the higher CPU utilization were off in futex land.
>
> perf should be able to tell you that pretty quickly, no?
>

In an ideal scenario perf would work well but unfortunately, in our
custom and quite limited environment, it isn't possible. At least, not
at the moment. We are currently working to correct this.

> A question about your workload; are you typically 'stuck' in
> futex_wait() or some of the requeue stuff (which is more complex). Let
> me assume regular futex_wait() for now.
>

It is futex_wait().

> > > Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0
> > > (futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the
> > > handling of private futexes and preventing a race condition by completing the
> > > function with a memory barrier. Now, I fully understand why this patch was implemented:
> > > to have a memory barrier before checking the "waiters." It makes sense.
>
> Right; well, a barrier isn't _before_ something, its between two things.
> And I think we're ordering the user-space store to the futex value
> against the waiters load, but the code could do with a comment here.
>
> And this should be totally irrelevant for x86 because a CPL change
> already implies a full barrier.
>
> > > What doesn't
> > > make sense (so far) is when I apply the same patch to the drop counterpart,
> > > drop_futex_key_refs(), and the problem goes away. See the change and my notes below.
> > >
> > >
> > > --- linux/kernel/futex.c.orig 2016-03-25 19:45:08.169563263 -0500
> > > +++ linux/kernel/futex.c 2016-03-25 19:46:06.901562211 -0500
> > > @@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu
> > >
> > > switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> > > case FUT_OFF_INODE:
> > > - iput(key->shared.inode);
> > > + iput(key->shared.inode); /* implies smp_mb(); (B) */
> > > break;
> > > case FUT_OFF_MMSHARED:
> > > - mmdrop(key->private.mm);
> > > + mmdrop(key->private.mm); /* implies smp_mb(); (B) */
> > > break;
> > > + default:
> > > + smp_mb(); /* explicit smp_mb(); (B) */
> > > }
> > > }
>
> So the patch makes sense from a symmetry POV, but I'm having a hard time
> explaining why this would make any difference to your workload.
>
> The best I can come up with is that the explicit MFENCE (as opposed to a
> LOCK prefixed instruction) causes a store-buffer flush and hence the
> waker more often sees !waiters and therefore tries to acquire the bucket
> lock less.
>
> But this is not a correctness (nor ordering) issue; but purely an
> architectural side-effect. Furthermore; some proposed changes:
>
> http://marc.info/?l=linux-kernel&m=145400059704564&w=2
>
> might change this side-effect.
>

Has there been an update to this patch? The last I see, the conversation
ended at the end of January, and there hasn't been a change in the
mainline.

>
> In any case; the below (completely irrelevant patch for you) is
> something I would propose. It gives hb_waiter_dec() RELEASE like
> semantics and ensures it cannot creep into the lock sections its
> typically behind. Although strictly speaking I think it being inside
> that lock region is sufficient.
>
> It also re-orders the increment in requeue to happen before we add to
> the list (as is the proper order) and removes a totally bogus comment;
> spin_lock() does _NOT_ imply a full barrier.
>
> ---
> kernel/futex.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 5d6ce6413ef1..615f277f313b 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -360,6 +360,7 @@ static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
> static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
> {
> #ifdef CONFIG_SMP
> + smp_mb__before_atomic();
> atomic_dec(&hb->waiters);
> #endif
> }
> @@ -1442,8 +1443,8 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
> if (likely(&hb1->chain != &hb2->chain)) {
> plist_del(&q->list, &hb1->chain);
> hb_waiters_dec(hb1);
> - plist_add(&q->list, &hb2->chain);
> hb_waiters_inc(hb2);
> + plist_add(&q->list, &hb2->chain);
> q->lock_ptr = &hb2->lock;
> }
> get_futex_key_refs(key2);
> @@ -1864,7 +1865,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
>
> q->lock_ptr = &hb->lock;
>
> - spin_lock(&hb->lock); /* implies MB (A) */
> + spin_lock(&hb->lock);
> return hb;
> }
>

Your adjustments here make complete sense. Are you preparing it for
submission in the near future?