Re: [v3,11/41] mips: reuse asm-generic/barrier.h

From: Will Deacon
Date: Mon Jan 25 2016 - 13:03:12 EST


Hi Paul,

On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > > and smp_read_acquire(),
> >
> > But they provide different grades of transitivity, which is where all
> > the confusion lays.
> >
> > smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> >
> > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > involved in the handover will agree on the order.
>
> Good point!
>
> Using grace periods in place of smp_mb() also provides strong/global
> transitivity, but also insanely high latencies. ;-)
>
> The patch below updates Documentation/memory-barriers.txt to define
> local vs. global transitivity. The corresponding ppcmem litmus test
> is included below as well.
>
> Should we start putting litmus tests for the various examples
> somewhere, perhaps in a litmus-tests directory within each participating
> architecture? I have a pile of powerpc-related litmus tests on my laptop,
> but they probably aren't doing all that much good there.

I too would like to have the litmus tests in the kernel so that we can
refer to them from memory-barriers.txt. Ideally they wouldn't be targetted
to a particular arch, however.

> PPC local-transitive
> ""
> {
> 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
> 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
> 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
> 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
> }
> P0 | P1 | P2 | P3 ;
> lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
> lwsync | lwsync | lwsync | sync ;
> stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
> lwsync | lwz r7,0(r2) | | ;
> stw r1,0(r5) | lwsync | | ;
> | stw r1,0(r6) | | ;
> exists
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
> (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
> (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)

i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.

> ------------------------------------------------------------------------
>
> commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> Date: Fri Jan 15 09:30:42 2016 -0800
>
> documentation: Distinguish between local and global transitivity
>
> The introduction of smp_load_acquire() and smp_store_release() had
> the side effect of introducing a weaker notion of transitivity:
> The transitivity of full smp_mb() barriers is global, but that
> of smp_store_release()/smp_load_acquire() chains is local. This
> commit therefore introduces the notion of local transitivity and
> gives an example.
>
> Reported-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Reported-by: Will Deacon <will.deacon@xxxxxxx>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index c66ba46d8079..d8109ed99342 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes.
> General barriers are therefore required to ensure that all CPUs agree
> on the combined order of CPU 1's and CPU 2's accesses.
>
> -To reiterate, if your code requires transitivity, use general barriers
> -throughout.
> +General barriers provide "global transitivity", so that all CPUs will
> +agree on the order of operations. In contrast, a chain of release-acquire
> +pairs provides only "local transitivity", so that only those CPUs on
> +the chain are guaranteed to agree on the combined order of the accesses.

Thanks for having a go at this. I tried defining something axiomatically,
but got stuck pretty quickly. In my scheme, I used "data-directed
transitivity" instead of "local transitivity", since the latter seems to
be a bit of a misnomer.

> +For example, switching to C code in deference to Herman Hollerith:
> +
> + int u, v, x, y, z;
> +
> + void cpu0(void)
> + {
> + r0 = smp_load_acquire(&x);
> + WRITE_ONCE(u, 1);
> + smp_store_release(&y, 1);
> + }
> +
> + void cpu1(void)
> + {
> + r1 = smp_load_acquire(&y);
> + r4 = READ_ONCE(v);
> + r5 = READ_ONCE(u);
> + smp_store_release(&z, 1);
> + }
> +
> + void cpu2(void)
> + {
> + r2 = smp_load_acquire(&z);
> + smp_store_release(&x, 1);
> + }
> +
> + void cpu3(void)
> + {
> + WRITE_ONCE(v, 1);
> + smp_mb();
> + r3 = READ_ONCE(u);
> + }
> +
> +Because cpu0(), cpu1(), and cpu2() participate in a local transitive
> +chain of smp_store_release()/smp_load_acquire() pairs, the following
> +outcome is prohibited:
> +
> + r0 == 1 && r1 == 1 && r2 == 1
> +
> +Furthermore, because of the release-acquire relationship between cpu0()
> +and cpu1(), cpu1() must see cpu0()'s writes, so that the following
> +outcome is prohibited:
> +
> + r1 == 1 && r5 == 0
> +
> +However, the transitivity of release-acquire is local to the participating
> +CPUs and does not apply to cpu3(). Therefore, the following outcome
> +is possible:
> +
> + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0

I think you should be completely explicit and include r5 == 1 here, too.

Also -- where would you add the smp_mb__after_release_acquire to fix
(i.e. forbid) this? Immediately after cpu1()'s read of y?

> +Although cpu0(), cpu1(), and cpu2() will see their respective reads and
> +writes in order, CPUs not involved in the release-acquire chain might
> +well disagree on the order. This disagreement stems from the fact that
> +the weak memory-barrier instructions used to implement smp_load_acquire()
> +and smp_store_release() are not required to order prior stores against
> +subsequent loads in all cases. This means that cpu3() can see cpu0()'s
> +store to u as happening -after- cpu1()'s load from v, even though
> +both cpu0() and cpu1() agree that these two operations occurred in the
> +intended order.
> +
> +However, please keep in mind that smp_load_acquire() is not magic.
> +In particular, it simply reads from its argument with ordering. It does
> +-not- ensure that any particular value will be read. Therefore, the
> +following outcome is possible:
> +
> + r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
> +
> +Note that this outcome can happen even on a mythical sequentially
> +consistent system where nothing is ever reordered.

I'm not sure this last bit is strictly necessary. If somebody thinks that
acquire/release involve some sort of implicit synchronisation, I think
they may have bigger problems with memory-barriers.txt.

Will