Re: [PATCH 0/3] 64-bit futexes: Intro

From: Nick Piggin
Date: Wed Jun 04 2008 - 21:45:43 EST


On Wed, Jun 04, 2008 at 12:57:13PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 3 Jun 2008, Nick Piggin wrote:
> >
> > I think optimised our unlock_page in a way that it can do a
> > non-atomic unlock in the fastpath (no waiters) using 2 bits. In
> > practice it was still atomic but only because other page flags
> > operations could operate on ->flags at the same time.
>
> I'd be *very* nervous about this.

Heh ;) Well I'm not actually trying to do it in Linux (yet).


> > We don't require any load/store barrier in the unlock_page fastpath
> > because the bits are in the same word, so cache coherency gives us a
> > sequential ordering anyway.
>
> Yes and no.
>
> Yes, the bits are int he same word, so cache coherency guarantees a lot.
>
> HOWEVER. If you do the sub-word write using a regular store, you are now
> invoking the _one_ non-coherent part of the x86 memory pipeline: the store
> buffer. Normal stores can (and will) be forwarded to subsequent loads from
> the store buffer, and they are not strongly ordered wrt cache coherency
> while they are buffered.
>
> IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other
> stores, but loads are *not* ordered wrt other stores in the absense of a
> serializing instruction, and it's exactly because of the write buffer.
>
> So:
>
> > But actually if we're careful, we can put them in seperate parts of the
> > word and use the sub-word operations on x86 to avoid the atomic
> > requirement. I'm not aware of any architecture in which operations to
> > the same word could be out of order.
>
> See above. The above is unsafe, because if you do a regular store to a
> partial word, with no serializing instructions between that and a
> subsequent load of the whole word, the value of the store can be bypassed
> from the store buffer, and the load from the other part of the word can be
> carried out _before_ the store has actually gotten that cacheline
> exclusively!
>
> So when you do
>
> movb reg,(byteptr)
> movl (byteptr),reg
>
> you may actually get old data in the upper 24 bits, along with new data in
> the lower 8.
>
> I think.

I'd be very surprised if that was the case. But the unlock code needn't
do that anyway. It could do

movb reg,(byteptr) # clear PG_locked
movb (byteptr+1),reg # load PG_waiters

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/