Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()

From: Michael S. Tsirkin
Date: Tue Jan 12 2016 - 12:45:35 EST


On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> > #ifdef xchgrz
> > /* same as xchg but poking at gcc red zone */
> > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> > #endif
>
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
>
> But..
>
> > Is this a good way to test it?
>
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).
>
> > So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> > are slightly faster, and xchg and locked instructions are very close if
> > not the same.
>
> Note that we never actually *use* lfence/sfence. They are pointless
> instructions when looking at CPU memory ordering, because for pure CPU
> memory ordering stores and loads are already ordered.
>
> The only reason to use lfence/sfence is after you've used nontemporal
> stores for IO.


By the way, the comment in barrier.h says:

/*
* Some non-Intel clones support out of order store. wmb() ceases to be
* a nop for these.
*/

and while the 1st sentence may well be true, if you have
an SMP system with out of order stores, making wmb
not a nop will not help.

Additionally as you point out, wmb is not a nop even
for regular intel CPUs because of these weird use-cases.

Drop this comment?

> That's very very rare in the kernel. So I wouldn't
> worry about those.

Right - I'll leave these alone, whoever wants to optimize this path will
have to do the necessary research.

> But yes, it does sound like mfence is just a bad idea too.
>
> > There isn't any extra magic behind mfence, is there?
>
> No.
>
> I think the only issue is that there has never been any real reason
> for CPU designers to try to make mfence go particularly fast. Nobody
> uses it, again with the exception of some odd loops that use
> nontemporal stores, and for those the cost tends to always be about
> the nontemporal accesses themselves (often to things like GPU memory
> over PCIe), and the mfence cost of a few extra cycles is negligible.
>
> The reason "lock ; add $0" has generally been the fastest we've found
> is simply that locked ops have been important for CPU designers.
>
> So I think the patch is fine, and we should likely drop the use of mfence..
>
> Linus

OK so should I repost after a bit more testing? I don't believe this
will affect the kernel build benchmark, but I'll try :)


--
MST