Re: [rfc][patch 4a/6] brlock: "fast" brlocks

From: Nick Piggin
Date: Thu Oct 15 2009 - 07:27:42 EST


On Thu, Oct 15, 2009 at 01:05:21PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-10-15 at 08:58 +0200, Nick Piggin wrote:
> > [Not for merge. Stop reading if you're not interested in locking minutiae.]
> >
> > OK, this is untested but I think the theory is right. Basically it is taking
> > the idea from Dave M's cool brlock optimisation stuff with one further
> > optimisation in that the read locker does not check the spinlock but
> > rather we keep another wlocked variable together inthe same cacheline per
> > CPU, so the read locker only has to touch one cacheline rather than 2.
> >
> > This actually will reduce the number of atomics by 2 per path lookup,
> > however we have an smp_mb() there now which is really nasty on some
> > architectures (like ia64 and ppc64), and not that nice on x86 either.
> > We can probably do something interesting on ia64 and ppc64 so that we
> > take advantage of the fact rlocked and wlocked are in the same cacheline
> > so cache coherency (rather than memory consistency) should always provide
> > a strict ordering there. We still do need an acquire barrier -- but it is
> > a much nicer lwsync or st.acq on ppc and ia64.
> >
> > But: is the avoidance of the atomic RMW a big win? On x86 cores I've tested
> > IIRC mfence is about as costly as a locked instruction which includes the
> > mfence...
> >
> > So long story short: it might be a small win but it is going to be very
> > arch specific and will require arch specific code to do the barriers and
> > things. The generic spinlock brlock isn't bad at all, so I'll just post
> > this as a curiosity for the time being.
> >
>
> fwiw, I rather like this implementation better, and adding lockdep
> annotations to this one shouldn't be hard.

OK, although there is nothing preventing us from using raw spinlocks
and a new lockdep object to annotate the other one, right?

The problem with this one is that firstly it is not suitable for a
generic implementation (see XXX, we technically need smp_mb in the
unlock rather than smp_wmb -- I don't know if any actual CPUs do
this but several architectures do allow for stores to pass loads)

So we _really_ need to do proper acquire and release barriers both
here in the unlock and in the lock as well for it to be competitive
with the spinlocks.

smp_mb is a nasty hammer especially for release barrier because it
prevents loads from being started early until the unlock store is
visible. For acquire barrier it theoretically is not so bad, but in
practice like on powerpc they do it with an instruction that has to
apparently go out to the interconnect.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/