Re: [PATCH] introduce sys_membarrier(): process-wide memory barrier(v10)

From: Randy Dunlap
Date: Mon Apr 05 2010 - 14:40:46 EST

Next message: david: "Re: A few questions and issues with dynticks, NOHZ and powertop"
Previous message: Alan Stern: "Re: 2.6.33.2: USB Quickcam Pro 9000 (usb resets)?"
In reply to: Mathieu Desnoyers: "[PATCH] introduce sys_membarrier(): process-wide memory barrier(v10)"
Next in thread: Mathieu Desnoyers: "Re: [PATCH] introduce sys_membarrier(): process-wide memorybarrier (v10)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 5 Apr 2010 13:57:37 -0400 Mathieu Desnoyers wrote:

> [ The last RFC submission got no reply, so I am submitting for real now, hoping
> to hear something else than the sound of "crickets" (yeah, it's unsually warm in
> Montreal for this time of the year). The only objections that remain are style
> issues pointed out by Ingo. These are answered below. ]
>

more chirping.

...

> This patch applies on top of -tip based on 2.6.34-rc3.
>
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process. It can be used
> to distribute the cost of user-space memory barriers asymmetrically by
> transforming pairs of memory barriers into pairs consisting of sys_membarrier()
> and a compiler barrier. For synchronization primitives that distinguish between
> read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be
> accelerated significantly by moving the bulk of the memory barrier overhead to
> the write-side.
>
> The first user of this system call is the "liburcu" Userspace RCU implementation
> found at http://lttng.org/urcu. It aims at greatly simplifying and enhancing the
> current implementation, which uses a scheme similar to the sys_membarrier(), but
> based on signals sent to each reader thread.

and what uses liburcu?

> This patch mostly sits in kernel/sched.c (it needs to access struct rq). It is
> based on kernel v2.6.34-rc3, and also applies fine to tip/master . I am
> proposing it for merge for 2.6.35. I think the -tip tree would be the right one
> to pick up this patch, as it touches sched.c.
>
...

>
> This patch only adds the system calls to x86 32/64. See the sys_membarrier()
> comments for memory barriers requirement in switch_mm() to port to other
> architectures.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> Acked-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Acked-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> CC: Nicholas Miell <nmiell@xxxxxxxxxxx>
> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> CC: mingo@xxxxxxx
> CC: laijs@xxxxxxxxxxxxxx
> CC: dipankar@xxxxxxxxxx
> CC: akpm@xxxxxxxxxxxxxxxxxxxx
> CC: josh@xxxxxxxxxxxxxxxx
> CC: dvhltc@xxxxxxxxxx
> CC: niv@xxxxxxxxxx
> CC: tglx@xxxxxxxxxxxxx
> CC: peterz@xxxxxxxxxxxxx
> CC: Valdis.Kletnieks@xxxxxx
> CC: dhowells@xxxxxxxxxx
> CC: Nick Piggin <npiggin@xxxxxxx>
> CC: Chris Friesen <cfriesen@xxxxxxxxxx>
> ---
> arch/x86/ia32/ia32entry.S | 1
> arch/x86/include/asm/mmu_context.h | 28 ++++-
> arch/x86/include/asm/unistd_32.h | 3
> arch/x86/include/asm/unistd_64.h | 2
> arch/x86/kernel/syscall_table_32.S | 1
> include/linux/Kbuild | 1
> include/linux/membarrier.h | 47 ++++++++
> kernel/sched.c | 194 +++++++++++++++++++++++++++++++++++++
> 8 files changed, 274 insertions(+), 3 deletions(-)
>
> Index: linux.trees.git/arch/x86/include/asm/unistd_64.h
> ===================================================================
> --- linux.trees.git.orig/arch/x86/include/asm/unistd_64.h 2010-03-30 09:48:03.000000000 -0400
> +++ linux.trees.git/arch/x86/include/asm/unistd_64.h 2010-04-05 13:46:16.000000000 -0400
> @@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt
> __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> #define __NR_recvmmsg 299
> __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
> +#define __NR_membarrier 300
> +__SYSCALL(__NR_membarrier, sys_membarrier)
>
> #ifndef __NO_STUBS
> #define __ARCH_WANT_OLD_READDIR
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-04-05 13:45:52.000000000 -0400
> +++ linux.trees.git/kernel/sched.c 2010-04-05 13:53:49.000000000 -0400
> @@ -71,6 +71,7 @@
> #include <linux/debugfs.h>
> #include <linux/ctype.h>
> #include <linux/ftrace.h>
> +#include <linux/membarrier.h>
>
> #include <asm/tlb.h>
> #include <asm/irq_regs.h>
> @@ -8951,6 +8952,199 @@ struct cgroup_subsys cpuacct_subsys = {
> };
> #endif /* CONFIG_CGROUP_CPUACCT */
>
> +#ifdef CONFIG_SMP
> +
> +/*
> + * Execute a memory barrier on all active threads from the current process
> + * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
> + * because batched IPI lists are synchronized with spinlocks rather than full
> + * memory barriers. This is not the bulk of the overhead anyway, so let's stay
> + * on the safe side.
> + */
> +static void membarrier_ipi(void *unused)
> +{
> + smp_mb();
> +}
> +
> +/*
> + * Handle out-of-mem by sending per-cpu IPIs instead.
> + */
> +static void membarrier_retry(void)
> +{
> + struct mm_struct *mm;
> + int cpu;
> +
> + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> + raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> + mm = cpu_curr(cpu)->mm;
> + raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> + if (current->mm == mm)
> + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> + }
> +}
> +
> +/*
> + * sys_membarrier - issue memory barrier on current process running threads
> + * @flags: One of these must be set:
> + * MEMBARRIER_EXPEDITED
> + * Adds some overhead, fast execution (few microseconds)
> + * MEMBARRIER_DELAYED
> + * Low overhead, but slow execution (few milliseconds)
> + *
> + * MEMBARRIER_QUERY
> + * This optional flag can be set to query if the kernel supports
> + * a set of flags.
> + *
> + * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel
> + * sys_membarrier support can be done by checking for -ENOSYS return value.
> + * Return values >= 0 indicate success. For a given set of flags on a given
> + * kernel, this system call will always return the same value. It is therefore
> + * correct to check the return value only once at library load, passing the

library load assumes caller is a library? does the kernel care about that?

> + * MEMBARRIER_QUERY flag in addition to only check if the flags are supported,
> + * without performing any synchronization.
> + *
> + * This system call executes a memory barrier on all running threads of the
> + * current process. Upon completion, the caller thread is ensured that all
> + * process threads have passed through a state where all memory accesses to
> + * user-space addresses match program order. (non-running threads are de facto
> + * in such a state)

state.)

> + *
> + * Using the non-expedited mode is recommended for applications which can
> + * afford leaving the caller thread waiting for a few milliseconds. A good
> + * example would be a thread dedicated to execute RCU callbacks, which waits
> + * for callbacks to enqueue most of the time anyway.
> + *
> + * The expedited mode is recommended whenever the application needs to have
> + * control returning to the caller thread as quickly as possible. An example
> + * of such application would be one which uses the same thread to perform
> + * data structure updates and issue the RCU synchronization.
> + *
> + * It is perfectly safe to call both expedited and non-expedited
> + * sys_membarrier() in a process.
> + *
> + * mm_cpumask is used as an approximation of the processors which run threads
> + * belonging to the current process. It is a superset of the cpumask to which we
> + * must send IPIs, mainly due to lazy TLB shootdown. Therefore, for each CPU in
> + * the mm_cpumask, we check each runqueue with the rq lock held to make sure our
> + * ->mm is indeed running on them. The rq lock ensures that a memory barrier is
> + * issued each time the rq current task is changed. This reduces the risk of
> + * disturbing a RT task by sending unnecessary IPIs. There is still a slight
> + * chance to disturb an unrelated task, because we do not lock the runqueues
> + * while sending IPIs, but the real-time effect of this heavy locking would be
> + * worse than the comparatively small disruption of an IPI.
> + *
> + * RED PEN: before assinging a system call number for sys_membarrier() to an

assigning

> + * architecture, we must ensure that switch_mm issues full memory barriers
> + * (or a synchronizing instruction having the same effect) between:
> + * - memory accesses to user-space addresses and clear mm_cpumask.
> + * - set mm_cpumask and memory accesses to user-space addresses.
> + *
> + * The reason why these memory barriers are required is that mm_cpumask updates,
> + * as well as iteration on the mm_cpumask, offer no ordering guarantees.
> + * These added memory barriers ensure that any thread modifying the mm_cpumask
> + * is in a state where all memory accesses to user-space addresses are
> + * guaranteed to be in program order.
> + *
> + * In some case adding a comment to this effect will suffice, in others we
> + * will need to add smp_mb__before_clear_bit()/smp_mb__after_clear_bit() or
> + * simply smp_mb(). These barriers are required to ensure we do not _miss_ a
> + * CPU that need to receive an IPI, which would be a bug.
> + *
> + * On uniprocessor systems, this system call simply returns 0 without doing
> + * anything, so user-space knows it is implemented.
> + *
> + * The flags argument has room for extensibility, with 16 lower bits holding
> + * mandatory flags for which older kernels will fail if they encounter an
> + * unknown flag. The high 16 bits are used for optional flags, which older
> + * kernels don't have to care about.
> + *
> + * This synchronization only takes care of threads using the current process
> + * memory map. It should not be used to synchronize accesses performed on memory
> + * maps shared between different processes.
> + */
> +SYSCALL_DEFINE1(membarrier, unsigned int, flags)
> +{
> + struct mm_struct *mm;
> + cpumask_var_t tmpmask;
> + int cpu;
> +
> + /*
> + * Expect _only_ one of expedited or delayed flags.
> + * Don't care about optional mask for now.
> + */
> + switch (flags & MEMBARRIER_MANDATORY_MASK) {
> + case MEMBARRIER_EXPEDITED:
> + case MEMBARRIER_DELAYED:
> + break;
> + default:
> + return -EINVAL;
> + }
> + if (unlikely(flags & MEMBARRIER_QUERY
> + || thread_group_empty(current))
> + || num_online_cpus() == 1)
> + return 0;
> + if (flags & MEMBARRIER_DELAYED) {
> + synchronize_sched();
> + return 0;
> + }
> + /*
> + * Memory barrier on the caller thread between previous memory accesses
> + * to user-space addresses and sending memory-barrier IPIs. Orders all
> + * user-space address memory accesses prior to sys_membarrier() before
> + * mm_cpumask read and membarrier_ipi executions. This barrier is paired
> + * with memory barriers in:
> + * - membarrier_ipi() (for each running threads of the current process)
> + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> + * accesses to user-space addresses)
> + * - Each CPU ->mm update performed with rq lock held by the scheduler.
> + * A memory barrier is issued each time ->mm is changed while the rq
> + * lock is held.
> + */
> + smp_mb();
> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> + membarrier_retry();
> + goto out;
> + }
> + cpumask_copy(tmpmask, mm_cpumask(current->mm));
> + preempt_disable();
> + cpumask_clear_cpu(smp_processor_id(), tmpmask);
> + for_each_cpu(cpu, tmpmask) {
> + raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> + mm = cpu_curr(cpu)->mm;
> + raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> + if (current->mm != mm)
> + cpumask_clear_cpu(cpu, tmpmask);
> + }
> + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> + preempt_enable();
> + free_cpumask_var(tmpmask);
> +out:
> + /*
> + * Memory barrier on the caller thread between sending&waiting for

sending & waiting

> + * memory-barrier IPIs and following memory accesses to user-space
> + * addresses. Orders mm_cpumask read and membarrier_ipi executions
> + * before all user-space address memory accesses following
> + * sys_membarrier(). This barrier is paired with memory barriers in:
> + * - membarrier_ipi() (for each running threads of the current process)
> + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> + * accesses to user-space addresses)
> + * - Each CPU ->mm update performed with rq lock held by the scheduler.
> + * A memory barrier is issued each time ->mm is changed while the rq
> + * lock is held.
> + */
> + smp_mb();
> + return 0;
> +}
> +
> +#else /* #ifdef CONFIG_SMP */

I don't know that we have a known convention for that, but I would use:

#else /* not CONFIG_SMP */

or

#else /* !CONFIG_SMP */

> +
> +SYSCALL_DEFINE1(membarrier, unsigned int, flags)
> +{
> + return 0;
> +}
> +
> +#endif /* #else #ifdef CONFIG_SMP */

and:

#endif /* CONFIG_SMP */

The "#else #ifdef" is both ugly and too wordy IMO.

> +
> #ifndef CONFIG_SMP
>
> int rcu_expedited_torture_stats(char *page)

---
~Randy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: david: "Re: A few questions and issues with dynticks, NOHZ and powertop"
Previous message: Alan Stern: "Re: 2.6.33.2: USB Quickcam Pro 9000 (usb resets)?"
In reply to: Mathieu Desnoyers: "[PATCH] introduce sys_membarrier(): process-wide memory barrier(v10)"
Next in thread: Mathieu Desnoyers: "Re: [PATCH] introduce sys_membarrier(): process-wide memorybarrier (v10)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]