Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()

From: Linus Torvalds
Date: Tue Jan 12 2016 - 12:20:16 EST


On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> #ifdef xchgrz
> /* same as xchg but poking at gcc red zone */
> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> #endif

That's not safe in general. gcc might be using its redzone, so doing
xchg into it is unsafe.

But..

> Is this a good way to test it?

.. it's fine for some basic testing. It doesn't show any subtle
interactions (ie some operations may have different dynamic behavior
when the write buffers are busy etc), but as a baseline for "how fast
can things go" the stupid raw loop is fine. And while the xchg into
the redzoen wouldn't be acceptable as a real implementation, for
timing testing it's likely fine (ie you aren't hitting the problem it
can cause).

> So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> are slightly faster, and xchg and locked instructions are very close if
> not the same.

Note that we never actually *use* lfence/sfence. They are pointless
instructions when looking at CPU memory ordering, because for pure CPU
memory ordering stores and loads are already ordered.

The only reason to use lfence/sfence is after you've used nontemporal
stores for IO. That's very very rare in the kernel. So I wouldn't
worry about those.

But yes, it does sound like mfence is just a bad idea too.

> There isn't any extra magic behind mfence, is there?

No.

I think the only issue is that there has never been any real reason
for CPU designers to try to make mfence go particularly fast. Nobody
uses it, again with the exception of some odd loops that use
nontemporal stores, and for those the cost tends to always be about
the nontemporal accesses themselves (often to things like GPU memory
over PCIe), and the mfence cost of a few extra cycles is negligible.

The reason "lock ; add $0" has generally been the fastest we've found
is simply that locked ops have been important for CPU designers.

So I think the patch is fine, and we should likely drop the use of mfence..

Linus