Re: [PATCH v2 4/4] membarrier: Execute SYNC_CORE on the calling thread

From: Mathieu Desnoyers
Date: Wed Dec 02 2020 - 14:40:45 EST


----- On Dec 2, 2020, at 10:35 AM, Andy Lutomirski luto@xxxxxxxxxx wrote:

> membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented
> as syncing the core on all sibling threads but not necessarily the
> calling thread. This behavior is fundamentally buggy and cannot be used
> safely. Suppose a user program has two threads. Thread A is on CPU 0
> and thread B is on CPU 1. Thread A modifies some text and calls
> membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE). Then thread B
> executes the modified code. If, at any point after membarrier() decides
> which CPUs to target, thread A could be preempted and replaced by thread
> B on CPU 0. This could even happen on exit from the membarrier()
> syscall. If this happens, thread B will end up running on CPU 0 without
> having synced.

Indeed, good catch! We only have sync core in the scheduler when switching
between mm, so indeed we cannot rely on the scheduler to issue a sync core
for us when switching between threads with the same mm.

> In principle, this could be fixed by arranging for the scheduler to
> sync_core_before_usermode() whenever switching between two threads in
> the same mm if there is any possibility of a concurrent membarrier()
> call, but this would have considerable overhead. Instead, make
> membarrier() sync the calling CPU as well.

Yes, I agree that sync core on self is the right approach here.

> As an optimization, this avoids an extra smp_mb() in the default
> barrier-only mode.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
> ---
> kernel/sched/membarrier.c | 49 +++++++++++++++++++++++++--------------
> 1 file changed, 32 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 01538b31f27e..7df7c0e60647 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -352,8 +352,6 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
>

There is one small optimization you will want to adapt here:

if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1)
return 0;

should become:

if (flags != MEMBARRIER_FLAG_SYNC_CORE && atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1)
return 0;

So we issue a core sync for self in single-threaded applications,
to make things consistent. We can then document that membarrier sync core
issues a core sync on all thread siblings including the caller thread.

> if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id))
> goto out;
> - if (cpu_id == raw_smp_processor_id())
> - goto out;
> rcu_read_lock();
> p = rcu_dereference(cpu_rq(cpu_id)->curr);
> if (!p || p->mm != mm) {
> @@ -368,16 +366,6 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> for_each_online_cpu(cpu) {
> struct task_struct *p;
>
> - /*
> - * Skipping the current CPU is OK even through we can be
> - * migrated at any point. The current CPU, at the point
> - * where we read raw_smp_processor_id(), is ensured to
> - * be in program order with respect to the caller
> - * thread. Therefore, we can skip this CPU from the
> - * iteration.
> - */
> - if (cpu == raw_smp_processor_id())
> - continue;
> p = rcu_dereference(cpu_rq(cpu)->curr);
> if (p && p->mm == mm)
> __cpumask_set_cpu(cpu, tmpmask);
> @@ -385,12 +373,39 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> rcu_read_unlock();
> }
>
> - preempt_disable();
> - if (cpu_id >= 0)
> + if (cpu_id >= 0) {
> + /*
> + * smp_call_function_single() will call ipi_func() if cpu_id
> + * is the calling CPU.
> + */
> smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> - else
> - smp_call_function_many(tmpmask, ipi_func, NULL, 1);
> - preempt_enable();
> + } else {
> + /*
> + * For regular membarrier, we can save a few cycles by
> + * skipping the current cpu -- we're about to do smp_mb()
> + * below, and if we migrate to a different cpu, this cpu
> + * and the new cpu will execute a full barrier in the
> + * scheduler.
> + *
> + * For CORE_SYNC, we do need a barrier on the current cpu --
> + * otherwise, if we are migrated and replaced by a different
> + * task in the same mm just before, during, or after
> + * membarrier, we will end up with some thread in the mm
> + * running without a core sync.
> + *
> + * For RSEQ, it seems polite to target the calling thread
> + * as well, although it's not clear it makes much difference
> + * either way. Users aren't supposed to run syscalls in an
> + * rseq critical section.

Considering that we want a consistent behavior between single and multi-threaded
programs (as I pointed out above wrt the optimization change), I think it would
be better to skip self for the rseq ipi, in the same way we'd want to return
early for a membarrier-rseq-private on a single-threaded mm. Users are _really_
not supposed to run system calls in rseq critical sections. The kernel even kills
the offending applications when run on kernels with CONFIG_DEBUG_RSEQ=y. So it seems
rather pointless to waste cycles doing a rseq fence on self considering this.

Thanks,

Mathieu

> + */
> + if (ipi_func == ipi_mb) {
> + preempt_disable();
> + smp_call_function_many(tmpmask, ipi_func, NULL, true);
> + preempt_enable();
> + } else {
> + on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
> + }
> + }
>
> out:
> if (cpu_id < 0)
> --
> 2.28.0

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com