Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

From: Paul E. McKenney
Date: Tue Jul 25 2017 - 17:19:39 EST


On Tue, Jul 25, 2017 at 10:24:51PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 25, 2017 at 12:36:12PM -0700, Paul E. McKenney wrote:
>
> > There are a lot of variations, to be sure. For whatever it is worth,
> > the original patch that started this uses mprotect():
> >
> > https://github.com/msullivan/userspace-rcu/commit/04656b468d418efbc5d934ab07954eb8395a7ab0
>
> FWIW that will not work on s390 (and maybe others), they don't in fact
> require IPIs for remote TLB invalidation.

Nor will it for ARM. Nor (I think) for PowerPC. But that is in fact
what people are doing right now in real life. Hence my renewed interest
in sys_membarrier().

> > > Which are those? I thought we significantly reduced those with the nohz
> > > full work. Most IPI uses now first check if a CPU actually needs the IPI
> > > before sending it IIRC.
> >
> > If the task being awakened is higher priority than the task currently
> > running on a given CPU, that CPU still gets an IPI, right? Or am I
> > completely confused?
>
> I was thinking of things like on_each_cpu_cond().

Fair enough.

But it would not be hard for userspace code to force IPIs by repeatedly
awakening higher-priority threads that sleep immediately after being
awakened, right?

> > Unlike userspace preempt disable, in this case we get the abuse anyway
> > via existing mechanisms, as in they are already being abused. If we
> > provide a mechanism for this purpose, we at least have the potential
> > for handling the abuse, for example:
> >
> > o "Defanging" sys_membarrier() on systems that are sensitive to
> > latency. For example, this patch can be defanged by booting
> > with the rcupdate.rcu_normal=1 kernel boot parameter, which
> > causes requests for expedited grace periods to instead use
> > normal grace periods.
> >
> > o Detecting and responding to abuse. For example, perhaps if there
> > are more than (say) 50 expedited sys_membarrier()s within a given
> > jiffy, the excess sys_membarrier()s are non-expedited.
> >
> > o Batching optimizations allow large number of concurrent requests
> > to be handled with fewer grace periods -- and both normal and
> > expedited grace periods already do exactly this.
> >
> > This horse is already out, so trying to shut the gate won't be effective.
>
> Well, I'm not sure there is an easy means of doing machine wide IPIs for
> !root out there. This would be a first.
>
> Something along the lines of:
>
> void dummy(void *arg)
> {
> /* IPIs are assumed to be serializing */
> }
>
> void ipi_mm(struct mm_struct *mm)
> {
> cpumask_var_t cpus;
> int cpu;
>
> zalloc_cpumask_var(&cpus, GFP_KERNEL);
>
> for_each_cpu(cpu, mm_cpumask(mm)) {
> struct task_struct *p;
>
> /*
> * If the current task of @cpu isn't of this @mm, then
> * it needs a context switch to become one, which will
> * provide the ordering we require.
> */
> rcu_read_lock();
> p = task_rcu_dereference(&cpu_curr(cpu));
> if (p && p->mm == mm)
> __cpumask_set_cpu(cpu, cpus);
> rcu_read_unlock();
> }
>
> on_each_cpu_mask(cpus, dummy, NULL, 1);
> }
>
> Would appear to be minimally invasive and only shoot at CPUs we're
> currently running our process on, which greatly reduces the impact.

I am good with this approach as well, and I do very much like that it
avoids IPIing CPUs that aren't running our process (at least in the
common case). But don't we also need added memory ordering? It is
sort of OK to IPI a CPU that just now switched away from our process,
but not so good to miss IPIing a CPU that switched to our process just
a little before sys_membarrier().

I was intending to base this on the last few versions of a 2010 patch,
but maybe things have changed:

https://marc.info/?l=linux-kernel&m=126358017229620&w=2
https://marc.info/?l=linux-kernel&m=126436996014016&w=2
https://marc.info/?l=linux-kernel&m=126601479802978&w=2
https://marc.info/?l=linux-kernel&m=126970692903302&w=2

Discussion here:

https://marc.info/?l=linux-kernel&m=126349766324224&w=2

The discussion led to acquiring the runqueue locks, as there was
otherwise a need to add code to the scheduler fastpaths.

There was a desire to make this work automatically among multiple
processes sharing some memory, but I believe that in this case
the user is going to have to track the multiple processes anyway,
and so can simply do sys_membarrier from each:

https://marc.info/?l=linux-arch&m=126686393820832&w=2

Some architectures are less precise than others in tracking which
CPUs are running a given process due to ASIDs, though this is
thought to be a non-problem:

https://marc.info/?l=linux-arch&m=126716090413065&w=2
https://marc.info/?l=linux-arch&m=126716262815202&w=2

Thoughts?

Thanx, Paul