Re: [PATCH 0/3] 64-bit futexes: Intro

From: Nick Piggin
Date: Thu Jun 05 2008 - 00:29:23 EST


On Wed, Jun 04, 2008 at 08:08:37PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 5 Jun 2008, Nick Piggin wrote:
> >
> > I'd have thought that for a case like this, you'd simply hit the store
> > alias logic and store forwarding would stall because it doesn't have
> > the full data.
>
> That's _one_ possible implementation.
>
> Quite frankly, I think it's the less likely one. It's much more likely
> that the cache read access and the store buffer probe happen in parallel
> (this is a really important hotpath for any CPU, but even more so x86
> where there are more of loads and stores that are spills). And then the
> store buffer logic would return the data and a bytemask mask (where the
> mask would be all zeroes for a miss), and the returned value is just the
> appropriate mix of the two.
>
> > I'd like to know for sure.
>
> You'd have to ask somebody very knowledgeable inside Intel and AMD, and it
> is quite likely that different microarchitectures have different
> approaches...

Well, it would just be nice to hear a "no we'll never do that", "we
already do", or "you can't rely on it" ;)


> > The other thing that could be possible, and I'd imagine maybe more likely
> > to be implemented in a real CPU because it should give more imrpovement
> > (and which does break my algorithm) is just that the load to the cacheline
> > may get to execute first, then if the cacheline gets evicted and
> > modified by another CPU before our store completes, we effectively see
> > store/load reordering again.
>
> Oh, absolutely, the perfect algorithm would actually get the right answer
> and notice that the cacheline got evicted, and retried the whole sequence
> such that it is coherent.
>
> But we do know that Intel expressly documents loads and stores to pass
> each other and documents the fact that the store buffer is there. So I bet
> that this is visible in some micro-architecture, even if it's not
> necessarily visible in _all_ of them.
>
> The recent Intel memory ordering whitepaper makes it very clear that loads
> can pass earlier stores and in particular that the store buffer allows
> intra-processor forwarding to subsequent loads (2.4 in their whitepaper).
> It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure
> it isn't. The store->load forwarding is such a critical performance issue
> that I can pretty much guarantee that it doesn't always hit the cacheline.

Well I have a simple test case to show loads pass earlier non conflicting
stores in the case that loads do not come from the store buffer (ie.
*inter* processor forwarding).

And store forwarding, by definition means that the load can complete before
the store can possibly be visible to another CPU I'd say. So yes, I'm
sure this does happen too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/