Re: [PATCH for v4.2 v18 1/3] sys_membarrier(): system-wide memory barrier (generic, x86)

From: Mathieu Desnoyers
Date: Sun May 31 2015 - 08:53:41 EST


----- On May 30, 2015, at 12:40 AM, Andrew Morton akpm@xxxxxxxxxxxxxxxxxxxx wrote:

> On Sat, 16 May 2015 19:48:18 -0400 Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
>> Here is an implementation of a new system call, sys_membarrier(), which
>> executes a memory barrier on all threads running on the system. It is
>> implemented by calling synchronize_sched(). It can be used to distribute
>> the cost of user-space memory barriers asymmetrically by transforming
>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>> compiler barrier. For synchronization primitives that distinguish
>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>> read-side can be accelerated significantly by moving the bulk of the
>> memory barrier overhead to the write-side.
>>
>> ...
>>
>
> It would be nice to hear about the real world value of this syscall to
> our users. I'm seeing test results for a microbenchmark but so what.
> What actual applications or application classes are calling for this and
> what results can they expect to see?

AFAIK, the existing open source applications that would be improved by this
system call are as follows:

* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used
by libraries, sys_membarrier can speed up the read-side by moving the
bulk of the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement Windows
FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their
github thread, specifically stating that sys_membarrier() is what they are looking
for.

>
>>
>> membarrier(2) man page:
>> --------------- snip -------------------
>> MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
>>
>> NAME
>> membarrier - issue memory barriers on a set of threads
>>
>> SYNOPSIS
>> #include <linux/membarrier.h>
>>
>> int membarrier(int cmd, int flags);
>>
>> DESCRIPTION
>> The cmd argument is one of the following:
>>
>> MEMBARRIER_CMD_QUERY
>> Query the set of supported commands. It returns a bitmask of
>> supported commands.
>>
>> MEMBARRIER_CMD_SHARED
>> Execute a memory barrier on all threads running on the system.
>> Upon return from system call, the caller thread is ensured that
>> all running threads have passed through a state where all memory
>> accesses to user-space addresses match program order between
>> entry to and return from the system call (non-running threads
>> are de facto in such a state). This covers threads from all pro___
>> cesses running on the system. This command returns 0.
>>
>> The flags argument needs to be 0. For future extensions.
>>
>> All memory accesses performed in program order from each targeted
>> thread is guaranteed to be ordered with respect to sys_membarrier(). If
>> we use the semantic "barrier()" to represent a compiler barrier forcing
>> memory accesses to be performed in program order across the barrier,
>> and smp_mb() to represent explicit memory barriers forcing full memory
>> ordering across the barrier, we have the following ordering table for
>> each pair of barrier(), sys_membarrier() and smp_mb():
>>
>> The pair ordering is detailed as (O: ordered, X: not ordered):
>>
>> barrier() smp_mb() sys_membarrier()
>> barrier() X X O
>> smp_mb() X O O
>> sys_membarrier() O O O
>>
>> RETURN VALUE
>> On success, these system calls return zero. On error, -1 is returned,
>> and errno is set appropriately. For a given command, with flags
>> argument set to 0, this system call is guaranteed to always return the
>> same value until reboot.
>
> I suggest "with flags argument set to MEMBARRIER_CMD_QUERY" here.

No, the enum is for the "cmd" argument (see above) not the flags argument. We
really mean flags = 0 (the value) here.

>
>>
>> ERRORS
>> ENOSYS System call is not implemented.
>>
>> EINVAL Invalid arguments.
>>
>> ...
>>
>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>> +{
>> + if (flags)
>> + return -EINVAL;
>
> I'm not a huge fan of this "add a flags arg to syscalls" rule. Is
> there any realistic expectation that we'll ever *use* this thing? If
> not, why add it?

I can see this system call evolve in a few ways in the future, such as
having an expedited version (using IPIs), targeting the local thread
group, and targeting all threads mapping a specific shared memory mapping.
I guess that the cmd argument should be enough to cover that, but
in doubt, it might be better to keep a flags argument there for future
needs we might be overlooking right now, so we never end up needing a
sys_membarrier2 system call.

>
> You may as well put an unlikely() in there btw.

Will do.

Thanks!

Mathieu

>
>> + switch (cmd) {
>> + case MEMBARRIER_CMD_QUERY:
>> + return MEMBARRIER_CMD_BITMASK;
>> + case MEMBARRIER_CMD_SHARED:
>> + if (num_online_cpus() > 1)
>> + synchronize_sched();
>> + return 0;
>> + default:
>> + return -EINVAL;
>> + }
> > +}

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/