Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()

From: Michael S. Tsirkin
Date: Wed Jan 13 2016 - 11:20:40 EST


On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.

It's an interesting read, thanks!

So sp is read on return from function I think. I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.


> But yes:
>
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
>
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
>
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
>
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
>
> Linus

I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.


---------------
main.c:
---------------

extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif


#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

void __attribute__ ((noinline)) test(int i, int *j)
{
/*
* Test barrier in a loop. We also poke at a volatile variable in an
* attempt to make it a bit more realistic - this way there's something
* in the store-buffer.
*/
x = i - *j;
barrier();
*j = x;
}

int main(int argc, char **argv)
{
int i;
int j = 1234;

for (i = 0; i < 10000000; ++i)
test(i, &j);

return 0;
}
---------------

ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat

all: ${ALL}
clean:
rm -f ${ALL}
run: all
for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed

.PHONY: all clean run perf time

xchgrz: CFLAGS += -mno-red-zone

${ALL}: main.c
${CC} ${CFLAGS} -D$@ -o $@ main.c

--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
0.080420565 seconds time elapsed ( +- 2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
0.087798571 seconds time elapsed ( +- 2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
0.083023724 seconds time elapsed ( +- 2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
0.102880750 seconds time elapsed ( +- 0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
0.084917420 seconds time elapsed ( +- 3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
0.156014715 seconds time elapsed ( +- 0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
0.077731443 seconds time elapsed ( +- 0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
0.036655741 seconds time elapsed ( +- 0.21% )