Re: [RFC PATCH v2] membarrier: expedited private command

From: Nicholas Piggin
Date: Mon Jul 31 2017 - 22:01:07 EST

Next message: Jiaheng Fan: "RE: [PATCH v1 1/6] dma: fsl-dma: add devicetree documentation for qdma driver."
Previous message: Mark yao: "Re: [PATCH 6/6] drm/rockchip: fix race with kms hotplug and fbdev"
In reply to: Mathieu Desnoyers: "Re: [RFC PATCH v2] membarrier: expedited private command"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH v2] membarrier: expedited private command"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 1 Aug 2017 01:33:09 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

> ----- On Jul 31, 2017, at 8:35 PM, Nicholas Piggin npiggin@xxxxxxxxx wrote:
>
> > On Mon, 31 Jul 2017 23:20:59 +1000
> > Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote:
> >
> >> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
> >>
> >> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
> >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> >> index e9785f7aed75..33f34a201255 100644
> >> >> --- a/kernel/sched/core.c
> >> >> +++ b/kernel/sched/core.c
> >> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> >> >> *prev)
> >> >> finish_arch_post_lock_switch();
> >> >>
> >> >> fire_sched_in_preempt_notifiers(current);
> >> >> +
> >> >> + /*
> >> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> >> >> + * rq->curr assignment. Not all architectures have one in either
> >> >> + * switch_to() or switch_mm() so we use (and complement) the one
> >> >> + * implied by mmdrop()'s atomic_dec_and_test().
> >> >> + */
> >> >> if (mm)
> >> >> mmdrop(mm);
> >> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> >> >> + smp_mb();
> >> >> +
> >> >> if (unlikely(prev_state == TASK_DEAD)) {
> >> >> if (prev->sched_class->task_dead)
> >> >> prev->sched_class->task_dead(prev);
> >> >>
> >> >>
> >> >
> >> >> a whole bunch of architectures don't in fact need this extra barrier at all.
> >> >
> >> > In fact, I'm fairly sure its only PPC.
> >> >
> >> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
> >> > anything other than smp_mb() (for now, Risc-V is in this same boat and
> >> > MIPS could be if they ever sort out their fancy barriers).
> >> >
> >> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
> >> > smp_mb() and there are enough around to make one happen (typically
> >> > mm_cpumask updates).
> >> >
> >> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
> >> > ACQUIRE/RELEASE.
> >> >
> >> > ARM64 has a super duper barrier in switch_to().
> >> >
> >> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
> >> > they'll probably need a barrier in switch_mm() in any case.
> >>
> >> I may have been sleep deprived. We have a patch, probably soon to be
> >> merged, which will add a smp_mb() in switch_mm() but *only* when we add
> >> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
> >>
> >> I'm not across membarrier enough to know if that's sufficient, but it
> >> seems unlikely?
> >
> > Won't be sufficient, they need a barrier after assigning rq->curr.
> > It can be avoided when switching between threads with the same mm.
> >
> > I would like to see how bad membarrier performance is if we made
> > that side heavy enough to avoid the barrier in context switch (e.g.,
> > by taking the rq locks, or using synchronize_sched_expedited -- on
> > a per-arch basis of course).
> >
> > Is there some (realistic-ish) benchmark using membarrier we can
> > experiment with?
>
> Hi Nick,
>
> I have benchmark programs using membarrier in liburcu [1]. They
> represent very heavy usage of membarrier (perhaps more extreme than
> reality).
>
> The master branch is at:
>
> git://git.liburcu.org/userspace-rcu.git
>
> Below is a test branch I quickly hacked to make urcu use membarrier
> MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>
> https://github.com/compudj/userspace-rcu-dev/tree/test-mb-private
>
> An example benchmark is under tests/benchmark
>
> Usage: test_urcu nr_readers nr_writers duration (s) <OPTIONS>
>
> Here is an example on my 16-core Intel machine (with hyperthreading, so don't
> expect this benchmark to be very reliable) with 4 reader threads, 1 writer
> thread. The writer thread in this benchmark issues a synchronize_rcu() at
> each iteration, so typically 2 sys_membarrier calls. I'm compiling with
> the configure flag "--disable-sys-membarrier-fallback" to have the full
> read-side speed.
>
> Using MEMBARRIER_CMD_SHARED:
>
> $ ./test_urcu 4 1 10
> SUMMARY [...] test_urcu testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 17522991852 nr_writes 249 nr_ops 17522992101
>
> 438'074'796 reads / reader thread / s
> 25 updates / s
>
> Using MEMBARRIER_CMD_PRIVATE_EXPEDITED:
>
> $ ./test_urcu 4 1 10
> SUMMARY [...] testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 12709653290 nr_writes 799668 nr_ops 12710452958
>
> 317'741'332 reads / reader thread / s
> 79'966 updates / s
>
> As a comparison, refer to the test_urcu_mb benchmark, which uses
> a full memory barrier in rcu_read_lock and rcu_read_unlock, and
> can therefore do without any sys_membarrier. This is our fallback
> when sys_membarrier returns -ENOSYS.
>
> $ ./test_urcu_mb 4 1 10
> SUMMARY [...] test_urcu_mb testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 137290634 nr_writes 8624184 nr_ops 145914818
>
> 3'432'265 reads / reader thread / s
> 862'418 updates / s
>
> You could try adapting my membarrier private-expedited patch to grab the
> rq lock for each CPU to see the impact on the benchmark for your specific
> architecture. I suspect that the overhead that will matter in that case
> is not only the speed of the membarrier operation, but also its disruption
> of the scheduler rq lock cachelines, which will have effect on other unrelated
> workloads. You may need to benchmark another scheduler-intensive workload
> executed in parallel with heavy use of sys_membarrier adapted to grab rq locks
> to figure this one out.
>
> Let me know if you have questions,

Thanks for this, I'll take a look. This should be a good start as a stress
test, but I'd also be interested in some application. The reason being that
for example using runqueue locks may give reasonable maximum throughput
numbers, but could cause some latency or slowdown when it's used in more
realistic scenario.

We don't have numbers for smp_mb in switch_mm. As you say that should be a
lower overhead, but probably still undesirable. But that's one of the
things I'm looking at.

Will report back after I've tried a few more things.

Thanks,
Nick

Next message: Jiaheng Fan: "RE: [PATCH v1 1/6] dma: fsl-dma: add devicetree documentation for qdma driver."
Previous message: Mark yao: "Re: [PATCH 6/6] drm/rockchip: fix race with kms hotplug and fbdev"
In reply to: Mathieu Desnoyers: "Re: [RFC PATCH v2] membarrier: expedited private command"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH v2] membarrier: expedited private command"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]