Re: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

From: josh
Date: Mon May 04 2015 - 17:31:11 EST


On Mon, May 04, 2015 at 05:00:12PM -0400, Mathieu Desnoyers wrote:
> * Benchmarks
>
> On Intel Xeon E5405 (8 cores)
> (one thread is calling sys_membarrier, the other 7 threads are busy
> looping)
>
> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>
> * User-space user of this system call: Userspace RCU library
>
> Both the signal-based and the sys_membarrier userspace RCU schemes
> permit us to remove the memory barrier from the userspace RCU
> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> accelerating them. These memory barriers are replaced by compiler
> barriers on the read-side, and all matching memory barriers on the
> write-side are turned into an invocation of a memory barrier on all
> active threads in the process. By letting the kernel perform this
> synchronization rather than dumbly sending a signal to every process
> threads (as we currently do), we diminish the number of unnecessary wake
> ups and only issue the memory barriers on active threads. Non-running
> threads do not need to execute such barrier anyway, because these are
> implied by the scheduler context switches.
>
> Results in liburcu:
>
> Operations in 10s, 6 readers, 2 writers:
>
> memory barriers in reader: 1701557485 reads, 3129842 writes
> signal-based scheme: 9825306874 reads, 5386 writes
> sys_membarrier: 7992076602 reads, 220 writes
>
> The dynamic sys_membarrier availability check adds some overhead to
> the read-side compared to the signal-based scheme, but besides that,
> with the expedited scheme, we can see that we are close to the read-side
> performance of the signal-based scheme. However, this non-expedited
> sys_membarrier implementation has a much slower grace period than signal
> and memory barrier schemes.
>
> An expedited version of this system call can be added later on to speed
> up the grace period. Its implementation will likely depend on reading
> the cpu_curr()->mm without holding each CPU's rq lock.

So, I realize that there's a lot of history tied up in the previous 16
versions and associated mail threads. However, can you please summarize
in the commit message what the benefit of merging this version is?
Because from the text above, from liburcu's perspective, it appears to
be strictly worse in performance than the signal-based scheme.

There are other non-performance reasons why it might make sense to
include this; for instance, signals don't play nice with libraries, with
other processes you might inject yourself into for tracing purposes, or
with general sanity. However, the explanation for those use cases and
how membarrier() improves them needs to go in the commit message, rather
than only in the collective memory and mail archives of people who have
discussed this patch series.

(My apologies if the if the explanation is in the commit message and
I've just missed it.)

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/