Re: Memory-ordering recipes

From: Andrea Parri
Date: Tue Oct 17 2017 - 21:08:04 EST


On Tue, Oct 17, 2017 at 02:01:37PM -0700, Paul E. McKenney wrote:
> On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > The topic of memory-ordering recipes came up at the Linux Plumbers
> > Conference microconference on Friday, so I thought that I should summarize
> > what is currently "out there":
>
> And here is an updated list of potential Linux-kernel examples for a
> "recipes" document, and thank you for the feedback. Please feel free
> to counterpropose better examples. In addition, if there is some other
> pattern that is commonplace and important enough to be included in a
> recipes document, please point it out.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel. This is intended to feed into
> the recipes document. All examples are from v4.13.
>
> 0. Simple special cases
>
> If there is only one CPU on the one hand or only one variable
> on the other, the code will execute in order. There are (as
> usual) some things to be careful of:
>
> a. There are some aspects of the C language that are
> unordered. For example, in the expression "f(x) + g(y)",
> the order in which f and g are called is not defined;
> the object code is allowed to use either order or even
> to interleave the computations.
>
> b. Compilers are permitted to use the "as-if" rule. That is,
> a compiler can emit whatever code it likes, as long as
> the results of a single-threaded execution appear just
> as if the compiler had followed all the relevant rules.
> To see this, compile with a high level of optimization
> and run the debugger on the resulting binary.
>
> c. If there is only one variable but multiple CPUs, all
> that variable must be properly aligned and all accesses
> to that variable must be full sized. Variables that
> straddle cachelines or pages void your full-ordering
> warranty, as do undersized accesses that load from or
> store to only part of the variable.
>
> 1. Another simple case: Locking. [ Assuming you don't think too
> hard about it, that is! ]
>
> Any CPU that has acquired a given lock sees any changes previously
> made by any CPU prior to having released that same lock.
>
> [ Should we discuss chaining back through different locks,
> sort of like release-acquire chains? ]
>
> 2. MP (see test6.pdf for nickname translation)
>
> a. smp_store_release() / smp_load_acquire()
>
> init_stack_slab() in lib/stackdepot.c uses release-acquire
> to handle initialization of a slab of the stack. Working
> out the mutual-exclusion design is left as an exercise for
> the reader.
>
> b. rcu_assign_pointer() / rcu_dereference()
>
> expand_to_next_prime() does the rcu_assign_pointer(),
> and next_prime_number() does the rcu_dereference().
> This mediates access to a bit vector that is expanded
> as additional primes are needed. These two functions
> are in lib/prime_numbers.c.
>
> c. smp_wmb() / smp_rmb()
>
> xlog_state_switch_iclogs() contains the following:
>
> log->l_curr_block -= log->l_logBBsize;
> ASSERT(log->l_curr_block >= 0);
> smp_wmb();
> log->l_curr_cycle++;
>
> And xlog_valid_lsn() contains the following:
>
> cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> smp_rmb();
> cur_block = ACCESS_ONCE(log->l_curr_block);
>
> Alternatively, from the comment in perf_output_put_handle()
> in kernel/events/ring_buffer.c:
>
> * kernel user
> *
> * if (LOAD ->data_tail) { LOAD ->data_head
> * (A) smp_rmb() (C)
> * STORE $data LOAD $data
> * smp_wmb() (B) smp_mb() (D)
> * STORE ->data_head STORE ->data_tail
> * }
> *
> * Where A pairs with D, and B pairs with C.
>
> The B/C pairing is MP with smp_wmb() and smp_rmb().
>
> d. Replacing either of the above with smp_mb()
>
> Holding off on this one for the moment...
>
> 3. LB
>
> a. LB+ctrl+mb
>
> Again, from the comment in perf_output_put_handle()
> in kernel/events/ring_buffer.c:
>
> * kernel user
> *
> * if (LOAD ->data_tail) { LOAD ->data_head
> * (A) smp_rmb() (C)
> * STORE $data LOAD $data
> * smp_wmb() (B) smp_mb() (D)
> * STORE ->data_head STORE ->data_tail
> * }
> *
> * Where A pairs with D, and B pairs with C.
>
> The A/D pairing covers this one.
>
> 4. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
>
> Lots of variety here, can in some cases substitute:
>
> a. READ_ONCE() for smp_load_acquire()
> b. WRITE_ONCE() for smp_store_release()
> c. Dependencies for both smp_load_acquire() and
> smp_store_release().
> d. smp_wmb() for smp_store_release() in first thread
> of ISA2 and Z6.2.
> e. smp_rmb() for smp_load_acquire() in last thread of ISA2.
>
> The canonical illustration of LB involves the various memory
> allocators, where you don't want a load from about-to-be-freed
> memory to see a store initializing a later incarnation of that
> same memory area. But the per-CPU caches make this a very
> long and complicated example.
>
> I am not aware of any three-CPU release-acquire chains in the
> Linux kernel. There are three-CPU lock-based chains in RCU,
> but these are not at all simple, either.
>
> Thoughts?
>
> 5. SB
>
> a. smp_mb(), as in lockless wait-wakeup coordination.
> And as in sys_membarrier()-scheduler coordination,
> for that matter.
>
> Examples seem to be lacking. Most cases use locking.
> Here is one rather strange one from RCU:
>
> void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> {
> unsigned long flags;
> bool needwake;
> bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
>
> rhp->next = NULL;
> rhp->func = func;
> raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> needwake = !rcu_tasks_cbs_head;
> *rcu_tasks_cbs_tail = rhp;
> rcu_tasks_cbs_tail = &rhp->next;
> raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> /* We can't create the thread unless interrupts are enabled. */
> if ((needwake && havetask) ||
> (!havetask && !irqs_disabled_flags(flags))) {
> rcu_spawn_tasks_kthread();
> wake_up(&rcu_tasks_cbs_wq);
> }
> }
>
> And for the wait side, using synchronize_sched() to supply
> the barrier for both ends, with the preemption disabling
> due to raw_spin_lock_irqsave() serving as the read-side
> critical section:
>
> if (!list) {
> wait_event_interruptible(rcu_tasks_cbs_wq,
> rcu_tasks_cbs_head);
> if (!rcu_tasks_cbs_head) {
> WARN_ON(signal_pending(current));
> schedule_timeout_interruptible(HZ/10);
> }
> continue;
> }
> synchronize_sched();
>
> -----------------
>
> Here is another one that uses atomic_cmpxchg() as a
> full memory barrier:
>
> if (!wait_event_timeout(*wait, !atomic_read(stopping),
> msecs_to_jiffies(1000))) {
> atomic_set(stopping, 0);
> smp_mb();
> return -ETIMEDOUT;
> }
>
> int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> atomic_t *stopping)
> {
> if (atomic_cmpxchg(stopping, 1, 0)) {
> wake_up(wait);
> return 1;
> }
>
> return 0;
> }
>
> -----------------
>
> And here is the generic pattern for the above two examples
> taken from waitqueue_active() in include/linux/wait.h:
>
> * CPU0 - waker CPU1 - waiter
> *
> * for (;;) {
> * @cond = true; prepare_to_wait(&wq_head, &wait, state);
> * smp_mb(); // smp_mb() from set_current_state()
> * if (waitqueue_active(wq_head)) if (@cond)
> * wake_up(wq_head); break;
> * schedule();
> * }
> * finish_wait(&wq_head, &wait);
>
> Note that prepare_to_wait() does the both the write
> and the set_current_state() that contains the smp_mb().
> The read is the waitqueue_active() on the one hand and
> the "if (@cond)" on the other.
>
> 6. W+RWC+porel+mb+mb
>
> See recipes-LKcode-63cae12bce986.txt.
>
> Mostly of historical interest -- as far as I know, this commit
> was the first to contain a litmus test.
>
> 7. Context switch and migration. A bit specialized, so might leave
> this one out.
>
> When a thread moves from one CPU to another to another, the
> scheduler is required to do whatever is necessary for the thread
> to see any prior accesses that it executed on other CPUs. This
> includes "interesting" interactions with wake_up() shown in the
> following comment from try_to_wake_up() in kernel/sched/core.c:
>
> * Notes on Program-Order guarantees on SMP systems.
> *
> * MIGRATION
> *
> * The basic program-order guarantee on SMP systems is that when a task [t]
> * migrates, all its activity on its old CPU [c0] happens-before any subsequent
> * execution on its new CPU [c1].
> *
> * For migration (of runnable tasks) this is provided by the following means:
> *
> * A) UNLOCK of the rq(c0)->lock scheduling out task t
> * B) migration for t is required to synchronize *both* rq(c0)->lock and
> * rq(c1)->lock (if not at the same time, then in that order).
> * C) LOCK of the rq(c1)->lock scheduling in task
> *
> * Transitivity guarantees that B happens after A and C after B.
> * Note: we only require RCpc transitivity.
> * Note: the CPU doing B need not be c0 or c1
> *
> * Example:
> *
> * CPU0 CPU1 CPU2
> *
> * LOCK rq(0)->lock
> * sched-out X
> * sched-in Y
> * UNLOCK rq(0)->lock
> *
> * LOCK rq(0)->lock // orders against CPU0
> * dequeue X
> * UNLOCK rq(0)->lock
> *
> * LOCK rq(1)->lock
> * enqueue X
> * UNLOCK rq(1)->lock
> *
> * LOCK rq(1)->lock // orders against CPU2
> * sched-out Z
> * sched-in X
> * UNLOCK rq(1)->lock

We might want to "specialize" this snippet to emphasize (or illustrate)
a particular _cycle_; one possible way of "closing the chain" would be:

CPU0

smp_store_release(&X->on_cpu, 0); /* from sched-out X */
UNLOCK rq(0)->lock


CPU2

LOCK rq(0)->lock
UNLOCK rq(1)->lock


CPU1

LOCK rq(1)->lock
X->on_cpu = 1; /* from sched-in X */


BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);

(c.f., Z6.2 in test6.pdf)


> *
> *
> * BLOCKING -- aka. SLEEP + WAKEUP
> *
> * For blocking we (obviously) need to provide the same guarantee as for
> * migration. However the means are completely different as there is no lock
> * chain to provide order. Instead we do:
> *
> * 1) smp_store_release(X->on_cpu, 0)
> * 2) smp_cond_load_acquire(!X->on_cpu)
> *
> * Example:
> *
> * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
> *
> * LOCK rq(0)->lock LOCK X->pi_lock
> * dequeue X
> * sched-out X
> * smp_store_release(X->on_cpu, 0);
> *
> * smp_cond_load_acquire(&X->on_cpu, !VAL);
> * X->state = WAKING
> * set_task_cpu(X,2)
> *
> * LOCK rq(2)->lock
> * enqueue X
> * X->state = RUNNING
> * UNLOCK rq(2)->lock
> *
> * LOCK rq(2)->lock // orders against CPU1
> * sched-out Z
> * sched-in X
> * UNLOCK rq(2)->lock
> *
> * UNLOCK X->pi_lock
> * UNLOCK rq(0)->lock

Similarly:

CPU0

smp_store_release(&X->on_cpu, 0); /* from sched-out X */


CPU1

smp_cond_load_acquire(&X->on_cpu, !VAL);
UNLOCK rq(2)->lock


CPU2

LOCK rq(2)->lock
X->on_cpu = 1; // from sched-in X


BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);

(c.f., WWC in test6.pdf)

Andrea


> *
> *
> * However; for wakeups there is a second guarantee we must provide, namely we
> * must observe the state that lead to our wakeup. That is, not only must our
> * task observe its own prior state, it must also observe the stores prior to
> * its wakeup.
> *
> * This means that any means of doing remote wakeups must order the CPU doing
> * the wakeup against the CPU the task is going to end up running on. This,
> * however, is already required for the regular Program-Order guarantee above,
> * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

> commit 63cae12bce9861cec309798d34701cf3da20bc71
> Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Date: Fri Dec 9 14:59:00 2016 +0100
>
> perf/core: Fix sys_perf_event_open() vs. hotplug
>
> There is problem with installing an event in a task that is 'stuck' on
> an offline CPU.
>
> Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
> blocked task doesn't run and doesn't require a CPU etc.. Only on
> wakeup do we ammend the situation and place the task on a available
> CPU.
>
> If we hit such a task with perf_install_in_context() we'll loop until
> either that task wakes up or the CPU comes back online, if the task
> waking depends on the event being installed, we're stuck.
>
> While looking into this issue, I also spotted another problem, if we
> hit a task with perf_install_in_context() that is in the middle of
> being migrated, that is we observe the old CPU before sending the IPI,
> but run the IPI (on the old CPU) while the task is already running on
> the new CPU, things also go sideways.
>
> Rework things to rely on task_curr() -- outside of rq->lock -- which
> is rather tricky. Imagine the following scenario where we're trying to
> install the first event into our task 't':
>
> CPU0 CPU1 CPU2
>
> (current == t)
>
> t->perf_event_ctxp[] = ctx;
> smp_mb();
> cpu = task_cpu(t);
>
> switch(t, n);
> migrate(t, 2);
> switch(p, t);
>
> ctx = t->perf_event_ctxp[]; // must not be NULL
>
> smp_function_call(cpu, ..);
>
> generic_exec_single()
> func();
> spin_lock(ctx->lock);
> if (task_curr(t)) // false
>
> add_event_to_ctx();
> spin_unlock(ctx->lock);
>
> perf_event_context_sched_in();
> spin_lock(ctx->lock);
> // sees event
>
> So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
> Because if CPU2's load of that variable were to observe NULL, it would
> not try to schedule the ctx and we'd have a task running without its
> counter, which would be 'bad'.
>
> As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
> first and not see the event yet, then CPU0 must observe task_curr()
> and retry. If the install happens first, then we must see the event on
> sched-in and all is well.
>
> I think we can translate the first part (until the 'must not be NULL')
> of the scenario to a litmus test like:
>
> C C-peterz
>
> {
> }
>
> P0(int *x, int *y)
> {
> int r1;
>
> WRITE_ONCE(*x, 1);
> smp_mb();
> r1 = READ_ONCE(*y);
> }
>
> P1(int *y, int *z)
> {
> WRITE_ONCE(*y, 1);
> smp_store_release(z, 1);
> }
>
> P2(int *x, int *z)
> {
> int r1;
> int r2;
>
> r1 = smp_load_acquire(z);
> smp_mb();
> r2 = READ_ONCE(*x);
> }
>
> exists
> (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
>
> Where:
> x is perf_event_ctxp[],
> y is our tasks's CPU, and
> z is our task being placed on the rq of CPU2.
>
> The P0 smp_mb() is the one added by this patch, ordering the store to
> perf_event_ctxp[] from find_get_context() and the load of task_cpu()
> in task_function_call().
>
> The smp_store_release/smp_load_acquire model the RCpc locking of the
> rq->lock and the smp_mb() of P2 is the context switch switching from
> whatever CPU2 was running to our task 't'.
>
> This litmus test evaluates into:
>
> Test C-peterz Allowed
> States 7
> 0:r1=0; 2:r1=0; 2:r2=0;
> 0:r1=0; 2:r1=0; 2:r2=1;
> 0:r1=0; 2:r1=1; 2:r2=1;
> 0:r1=1; 2:r1=0; 2:r2=0;
> 0:r1=1; 2:r1=0; 2:r2=1;
> 0:r1=1; 2:r1=1; 2:r2=0;
> 0:r1=1; 2:r1=1; 2:r2=1;
> No
> Witnesses
> Positive: 0 Negative: 7
> Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
> Observation C-peterz Never 0 7
> Hash=e427f41d9146b2a5445101d3e2fcaa34
>
> And the strong and weak model agree.
>
> Reported-by: Mark Rutland <mark.rutland@xxxxxxx>
> Tested-by: Mark Rutland <mark.rutland@xxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
> Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
> Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
> Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> Cc: Stephane Eranian <eranian@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Vince Weaver <vincent.weaver@xxxxxxxxx>
> Cc: Will Deacon <will.deacon@xxxxxxx>
> Cc: jeremy.linton@xxxxxxx
> Link: http://lkml.kernel.org/r/20161209135900.GU3174@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ab15509fab8c..72ce7d63e561 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2249,7 +2249,7 @@ static int __perf_install_in_context(void *info)
> struct perf_event_context *ctx = event->ctx;
> struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> struct perf_event_context *task_ctx = cpuctx->task_ctx;
> - bool activate = true;
> + bool reprogram = true;
> int ret = 0;
>
> raw_spin_lock(&cpuctx->ctx.lock);
> @@ -2257,27 +2257,26 @@ static int __perf_install_in_context(void *info)
> raw_spin_lock(&ctx->lock);
> task_ctx = ctx;
>
> - /* If we're on the wrong CPU, try again */
> - if (task_cpu(ctx->task) != smp_processor_id()) {
> - ret = -ESRCH;
> - goto unlock;
> - }
> + reprogram = (ctx->task == current);
>
> /*
> - * If we're on the right CPU, see if the task we target is
> - * current, if not we don't have to activate the ctx, a future
> - * context switch will do that for us.
> + * If the task is running, it must be running on this CPU,
> + * otherwise we cannot reprogram things.
> + *
> + * If its not running, we don't care, ctx->lock will
> + * serialize against it becoming runnable.
> */
> - if (ctx->task != current)
> - activate = false;
> - else
> - WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> + if (task_curr(ctx->task) && !reprogram) {
> + ret = -ESRCH;
> + goto unlock;
> + }
>
> + WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> } else if (task_ctx) {
> raw_spin_lock(&task_ctx->lock);
> }
>
> - if (activate) {
> + if (reprogram) {
> ctx_sched_out(ctx, cpuctx, EVENT_TIME);
> add_event_to_ctx(event, ctx);
> ctx_resched(cpuctx, task_ctx);
> @@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx,
> /*
> * Installing events is tricky because we cannot rely on ctx->is_active
> * to be set in case this is the nr_events 0 -> 1 transition.
> + *
> + * Instead we use task_curr(), which tells us if the task is running.
> + * However, since we use task_curr() outside of rq::lock, we can race
> + * against the actual state. This means the result can be wrong.
> + *
> + * If we get a false positive, we retry, this is harmless.
> + *
> + * If we get a false negative, things are complicated. If we are after
> + * perf_event_context_sched_in() ctx::lock will serialize us, and the
> + * value must be correct. If we're before, it doesn't matter since
> + * perf_event_context_sched_in() will program the counter.
> + *
> + * However, this hinges on the remote context switch having observed
> + * our task->perf_event_ctxp[] store, such that it will in fact take
> + * ctx::lock in perf_event_context_sched_in().
> + *
> + * We do this by task_function_call(), if the IPI fails to hit the task
> + * we know any future context switch of task must see the
> + * perf_event_ctpx[] store.
> */
> -again:
> +
> /*
> - * Cannot use task_function_call() because we need to run on the task's
> - * CPU regardless of whether its current or not.
> + * This smp_mb() orders the task->perf_event_ctxp[] store with the
> + * task_cpu() load, such that if the IPI then does not find the task
> + * running, a future context switch of that task must observe the
> + * store.
> */
> - if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
> + smp_mb();
> +again:
> + if (!task_function_call(task, __perf_install_in_context, event))
> return;
>
> raw_spin_lock_irq(&ctx->lock);
> @@ -2348,12 +2370,16 @@ again:
> raw_spin_unlock_irq(&ctx->lock);
> return;
> }
> - raw_spin_unlock_irq(&ctx->lock);
> /*
> - * Since !ctx->is_active doesn't mean anything, we must IPI
> - * unconditionally.
> + * If the task is not running, ctx->lock will avoid it becoming so,
> + * thus we can safely install the event.
> */
> - goto again;
> + if (task_curr(task)) {
> + raw_spin_unlock_irq(&ctx->lock);
> + goto again;
> + }
> + add_event_to_ctx(event, ctx);
> + raw_spin_unlock_irq(&ctx->lock);
> }
>
> /*