Re: Udpated sys_membarrier() speedup patch, FYI

From: Paul E. McKenney
Date: Thu Jul 27 2017 - 15:06:45 EST


On Thu, Jul 27, 2017 at 11:36:38AM -0700, Andrew Hunter wrote:
> On Thu, Jul 27, 2017 at 11:12 AM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > Hello!
> > But my main question is whether the throttling shown below is acceptable
> > for your use cases, namely only one expedited sys_membarrier() permitted
> > per scheduling-clock period (1 millisecond on many platforms), with any
> > excess being silently converted to non-expedited form.
>
> Google doesn't use sys_membarrier (that I know of...), but we do use
> RSEQ fences, which implements membarrier + a little extra to interrupt
> RSEQ critical sections (via IPI--smp_call_function_many.) One
> important optimization here is that we only throw IPIs to cpus running
> the same mm as current (or a subset if requested by userspace), as
> this is sufficient for the API guarantees we provide. I suspect a
> similar optimization would largely mitigate DOS concerns, no? I don't
> know if there are use cases not covered. To answer your question:
> throttling these (or our equivalents) would be fine in terms of
> userspace throughput. We haven't noticed performance problems
> requiring such an intervention, however.

IPIin only those CPUs running threads in the same process as the
thread invoking membarrier() would be very nice! There is some LKML
discussion on this topic, which is currently circling around making this
determination reliable on all CPU families. ARM and x86 are thought
to be OK, PowerPC is thought to require a smallish patch, MIPS is
a big question mark, and so on.

Good to hear that the throttling would be OK for your workloads,
thank you!

> Furthermore: I wince a bit at the silent downgrade; I'd almost prefer
> -EAGAIN or -EBUSY. In particular, again for RSEQ fence, the downgrade
> simply wouldn't work; rcu_sched_qs() gets called at many points that
> aren't sufficiently quiescent for RSEQ (in particular, when userspace
> code is running!) This is solvable, but worth thinking about.

Good point! One approach would be to unconditionally return -EAGAIN/-EBUSY
and another would be to have a separate cmd or flag to say what to do
if expedited wasn't currently available. My thought would be to
add a separate expedited command, so that one did the fallback and
the other returned the error.

But I am surprised when you say that the downgrade would not work, at
least if you are not running with nohz_full CPUs. The rcu_sched_qs()
function simply sets a per-CPU quiescent-state flag. The needed strong
ordering is instead supplied by the combination of the code starting
the grace period, reporting the setting of the quiescent-state flag
to core RCU, and the code completing the grace period. Each non-idle
CPU will execute full memory barriers either in RCU_SOFTIRQ context,
on entry to idle, on exit from idle, or within the grace-period kthread.
In particular, a CPU running the same usermode thread for the entire
grace period will execute the needed memory barriers in RCU_SOFTIRQ
context shortly after taking a scheduling-clock interrupt.

So are you running nohz_full CPUs? Or is there something else that I
am missing?

Thanx, Paul