[RFC PATCH v13] sys_membarrier(): system/process-wide memory barrier (x86)

From: Mathieu Desnoyers
Date: Tue Mar 17 2015 - 13:22:42 EST


Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on either all running threads of the current
process (MEMBARRIER_PRIVATE) issues a memory barrier on all threads
running on the system (~MEMBARRIER_PRIVATE). Both are currently
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.

It is based on kernel v3.19, and also applies fine to master. I am
submitting it as RFC. I'm asking for acked-by and reviewed-by tags
again because I did major changes/simplifications since v12.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses

After the change, these pairs become:

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 3129842 writes
signal-based scheme: 9825306874 reads, 5386 writes
sys_membarrier: 7992076602 reads, 220 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
with the expedited scheme, we can see that we are close to the read-side
performance of the signal-based scheme. However, this non-expedited
sys_membarrier implementation has a much slower grace period than signal
and memory barrier schemes.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch only adds the system call to x86.

[1] http://urcu.so

Alternative approach: signals. A signal-based alternative proposed by
Ingo would lead to important scheduler modifications, which involves
adding context-switch in/out overhead (calling user-space upon scheduler
switch in and out). In addition to the overhead issue, I am also
reluctant to base a synchronization primitive on the signals, which, to
quote Linus, are "already one of our more "exciting" layers out there",
which does not give me the warm feeling of rock-solidness that's usually
expected from synchronization primitives.

Changes since v12:
- Remove _FLAG suffix from uapi flags.
- Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
- Remove EXPEDITED mode. Only implement non-expedited for now, until
reading the cpu_curr()->mm can be done without holding the CPU's rq
lock.

Changes since v11:
- 5 years have passed.
- Rebase on v3.19 kernel.
- Add futex-alike PRIVATE vs SHARED semantic: private for per-process
barriers, non-private for memory mappings shared between processes.
- Simplify user API.
- Code refactoring.

Changes since v10:
- Apply Randy's comments.
- Rebase on 2.6.34-rc4 -tip.

Changes since v9:
- Clean up #ifdef CONFIG_SMP.

Changes since v8:
- Go back to rq spin locks taken by sys_membarrier() rather than adding
memory barriers to the scheduler. It implies a potential RoS
(reduction of service) if sys_membarrier() is executed in a busy-loop
by a user, but nothing more than what is already possible with other
existing system calls, but saves memory barriers in the scheduler fast
path.
- re-add the memory barrier comments to x86 switch_mm() as an example to
other architectures.
- Update documentation of the memory barriers in sys_membarrier and
switch_mm().
- Append execution scenarios to the changelog showing the purpose of
each memory barrier.

Changes since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned
to incrementally reserve syscall IDs on other architectures as these
are tested.

Changes since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU
read lock in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule()
and finish_lock_switch(), where they clearly document that all data
protected by the rq lock is guaranteed to have memory barriers issued
between the scheduler update and the task execution. Replacing the
spin lock acquire/release barriers with these memory barriers imply
either no overhead (x86 spinlock atomic instruction already implies a
full mb) or some hopefully small overhead caused by the upgrade of the
spinlock acquire/release barriers to more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to
standard spinlocks and full memory barriers. Each architecture can
specialize this header following their own need and declare
CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
implementations on a wide range of architecture would be welcome.

Changes since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks
to the "flags" system call parameter. Past experience with accept4(),
signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
inotify_init1() indicates that this is the kind of thing we want to
plan for. Return -EINVAL if the mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changes since v4:
- Add "int expedited" parameter, use synchronize_sched() in the
non-expedited case. Thanks to Lai Jiangshan for making us consider
seriously using synchronize_sched() to provide the low-overhead
membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changes since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before
sending an IPI. Ensures that we do not disturb RT tasks in the
presence of lazy TLB shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changes since v2:
- simply send-to-many to the mm_cpumask. It contains the list of
processors we have to IPI to (which use the mm), and this mask is
updated atomically.

Changes since v1:
- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
CC: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
CC: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
CC: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
CC: Nicholas Miell <nmiell@xxxxxxxxxxx>
CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
CC: Ingo Molnar <mingo@xxxxxxxxxx>
CC: Alan Cox <gnomes@xxxxxxxxxxxxxxxxxxx>
CC: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
CC: Stephen Hemminger <stephen@xxxxxxxxxxxxxxxxxx>
CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
CC: David Howells <dhowells@xxxxxxxxxx>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 ++-
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/membarrier.h | 68 ++++++++++++++++++++++++++++++++++
init/Kconfig | 9 +++++
kernel/rcu/update.c | 72 +++++++++++++++++++++++++++++++++++++
kernel/sys_ni.c | 3 ++
9 files changed, 160 insertions(+), 1 deletions(-)
create mode 100644 include/uapi/linux/membarrier.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..439415f 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 membarrier sys_membarrier
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..823130d 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common membarrier sys_membarrier

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..058ec0a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -882,4 +882,6 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *argv,
const char __user *const __user *envp, int flags);

+asmlinkage long sys_membarrier(int flags);
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..8da542a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_membarrier 282
+__SYSCALL(__NR_membarrier, sys_membarrier)

#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283

/*
* All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 00b10002..c5b0dbf 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -248,6 +248,7 @@ header-y += mdio.h
header-y += media.h
header-y += media-bus-format.h
header-y += mei.h
+header-y += membarrier.h
header-y += memfd.h
header-y += mempolicy.h
header-y += meye.h
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
new file mode 100644
index 0000000..eb1a820
--- /dev/null
+++ b/include/uapi/linux/membarrier.h
@@ -0,0 +1,68 @@
+#ifndef _UAPI_LINUX_MEMBARRIER_H
+#define _UAPI_LINUX_MEMBARRIER_H
+
+/*
+ * linux/membarrier.h
+ *
+ * membarrier system call API
+ *
+ * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * All memory accesses performed in program order from each targeted thread is
+ * guaranteed to be ordered with respect to sys_membarrier(). If we use the
+ * semantic "barrier()" to represent a compiler barrier forcing memory accesses
+ * to be performed in program order across the barrier, and smp_mb() to
+ * represent explicit memory barriers forcing full memory ordering across the
+ * barrier, we have the following ordering table for each pair of barrier(),
+ * sys_membarrier() and smp_mb() :
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ * barrier() smp_mb() sys_membarrier()
+ * barrier() X X O
+ * smp_mb() X O O
+ * sys_membarrier() O O O
+ *
+ * If the private flag is set, only running threads belonging to the same
+ * process are targeted. Else, all running threads, including those belonging to
+ * other processes are targeted.
+ */
+
+/* System call membarrier "flags" argument. */
+enum {
+ /*
+ * Private flag set: only synchronize across a single process. If this
+ * flag is not set, it means "shared": synchronize across multiple
+ * processes. The shared mode is useful for shared memory mappings
+ * across processes.
+ */
+ MEMBARRIER_PRIVATE = (1 << 0),
+
+ /*
+ * Query whether the rest of the specified flags are supported, without
+ * performing synchronization.
+ */
+ MEMBARRIER_QUERY = (1 << 31),
+};
+
+#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9afb971..c5f58ce 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1568,6 +1568,15 @@ config PCI_QUIRKS
bugs/quirks. Disable this only if your target machine is
unaffected by PCI quirks.

+config MEMBARRIER
+ bool "Enable membarrier() system call" if EXPERT
+ default y
+ help
+ Enable the membarrier() system call that allows issuing
+ memory barriers across cores.
+
+ If unsure, say Y.
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index e0d31a3..9aec3d6 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -49,6 +49,8 @@
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/tick.h>
+#include <linux/syscalls.h>
+#include <linux/membarrier.h>

#define CREATE_TRACE_POINTS

@@ -775,3 +777,73 @@ late_initcall(rcu_verify_early_boot_tests);
#else
void rcu_early_boot_tests(void) {}
#endif /* CONFIG_PROVE_RCU */
+
+#ifdef CONFIG_MEMBARRIER
+
+static int membarrier_validate_flags(int flags)
+{
+ /* Check for unrecognized flag. */
+ if (flags & ~(MEMBARRIER_PRIVATE | MEMBARRIER_QUERY))
+ return -EINVAL;
+ return 0;
+}
+
+#ifdef CONFIG_SMP
+
+/*
+ * sys_membarrier - issue memory barrier on target threads
+ * @flags: MEMBARRIER_PRIVATE:
+ * Private flag set: only synchronize across a single process. If
+ * this flag is not set, it means "shared": synchronize across
+ * multiple processes. The shared mode is useful for shared memory
+ * mappings across processes.
+ * MEMBARRIER_QUERY:
+ * Query whether the rest of the specified flags are supported,
+ * without performing synchronization.
+ *
+ * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel
+ * sys_membarrier support can be done by checking for -ENOSYS return value.
+ * Return value of 0 indicates success. For a given set of flags on a given
+ * kernel, this system call will always return the same value. It is therefore
+ * correct to check the return value only once during a process lifetime,
+ * setting MEMBARRIER_QUERY to only check if the flags are supported,
+ * without performing any synchronization.
+ *
+ * This system call executes a memory barrier on all targeted threads.
+ * If the private flag is set, only running threads belonging to the same
+ * process are targeted. Else, all running threads, including those belonging to
+ * other processes are targeted. Upon completion, the caller thread is ensured
+ * that all targeted running threads have passed through a state where all
+ * memory accesses to user-space addresses match program order. (non-running
+ * threads are de facto in such a state.)
+ *
+ * On uniprocessor systems, this system call simply returns 0 after validating
+ * the arguments, so user-space knows it is implemented.
+ */
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+ int retval;
+
+ retval = membarrier_validate_flags(flags);
+ if (retval)
+ goto end;
+ if (unlikely((flags & MEMBARRIER_QUERY)
+ || ((flags & MEMBARRIER_PRIVATE)
+ && thread_group_empty(current)))
+ || num_online_cpus() == 1)
+ goto end;
+ synchronize_sched();
+end:
+ return retval;
+}
+
+#else /* !CONFIG_SMP */
+
+SYSCALL_DEFINE1(membarrier, int, flags)
+{
+ return membarrier_validate_flags(flags);
+}
+
+#endif /* CONFIG_SMP */
+
+#endif /* CONFIG_MEMBARRIER */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..5913b84 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);

/* execveat */
cond_syscall(sys_execveat);
+
+/* membarrier */
+cond_syscall(sys_membarrier);
--
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/