Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

From: Alan Stern
Date: Tue Jan 31 2023 - 11:55:24 EST


On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote:
>
>
> On 1/31/2023 4:06 PM, Alan Stern wrote:
> > On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
> > > I have some additional thoughts now. It seems that you could weaken the
> > > operational model by stating that an A-cumulative fence orders propagation
> > > of all *external* stores (in addition to all po-earlier stores) that
> > > propagated to you before the fence is executed.
> > How is that a weakening of the operational model? It's what the
> > operational model says right now.
>
> No, as in the part that you have quoted, it is stated that an A-cumulative
> fence orderes propagation of *all* stores that propagated to you before the
> fence is executed.
> I'm saying you could weaken this to only cover all *external* stores.

Okay, now I understand.

> More precisely, I would change
>
> > For each other CPU C', any store which propagates to C before
> > a release fence is executed (including all po-earlier
> > stores executed on C) is forced to propagate to C' before the
> > store associated with the release fence does.
>
> Into something like
>
>
>      For each other CPU C', any *external* store which propagates to C
> before
>      a release fence is executed as well as any po-earlier
>      store executed on C is forced to propagate to C' before the
>      store associated with the release fence does.
>
> The difference is that po-later stores that happen to propagate to C before
> the release fence is executed would no longer be ordered.
> That should be consistent with the axiomatic model.

I had to check that it wouldn't affect the (xbstar & int) part of vis,
but it looks all right. This seems like a reasonable change.

However, it only fixes part of the problem. Suppose an external write
is read by an instruction po-after the release-store, but the read
executes before the release-store. The operational model would still
say the external write has to obey the propagation ordering, whereas the
formal model doesn't require it.

> > > > P0(int *x, int *y, int *z)
> > > > {
> > > > int r1;
> > > >
> > > > r1 = READ_ONCE(*x);
> > > > smp_store_release(y, 1);
> > > > WRITE_ONCE(*z, 1);
> > > > }
> > > >
> > > > P1(int *x, int *y, int *z)
> > > > {
> > > > int r2;
> > > >
> > > > r2 = READ_ONCE(*z);
> > > > WRITE_ONCE(*x, r2);
> > > > }
> > > >
> > > > P2(int *x, int *y, int *z)
> > > > {
> > > > int r3;
> > > > int r4;
> > > >
> > > > r3 = READ_ONCE(*y);
> > > > smp_rmb();
> > > > r4 = READ_ONCE(*z);
> > > > }
> > > >
> > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
> > > I could imagine that P0 posts both of its stores in a shared store buffer
> > > before reading *x, but marks the release store as "not ready".
> > > Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
> > > and subsequently marks its release store as "ready".
> > That isn't how release stores are meant to work. The read of x is
> > supposed to be complete before the release store becomes visible to any
> > other CPU.
>
> Note that the release store isn't observed until it becomes "ready", so it
> is really indistinguishable of whether it had become visible to any other
> CPU.
> Indeed stores that aren't marked "ready" would be ignored during forwarding,
> and not allowed to be pushed to the cache.

Oops, I mixed up a couple of the accesses. Okay, yes, this mechanism
will allow writes that are po-after a release store but execute before
it to evade the propagation restriction.

> The reason this kind of implementation seems less natural to me is that such
> an "not ready" store would need to be pushed back in the buffer (if it is
> the head of the buffer and the cache is ready to take a store), stall the
> later stores, or be aborted until it becomes ready.
> That just seems to create a lot of hassle for no discernible benefit.
> A "not ready" store probably shouldn't be put into a store queue, even if
> the only reason it is not ready is that there are some otherwise unrelated
> reads that haven't completed yet.
>
>
>
> > This is true even in C11.
>
> Arguable... The following pseudo-code litmus test should demonstrate this:
>
> P0 {
>    int r = read_relaxed(&x);
>    store_release(&y,1);
> }
>
>
> P1 {
>    int s = read_relaxed(&y);
>    store_release(&x,1);
> }
>
> In C11, it should be possible to read r==s==1.

True, in C11 releases don't mean anything unless they're paired with
acquires. But if your P1 had been

int s = read_acquire(&y);
write_relaxed(&x, 1);

then r = s = 1 would not be allowed. And presumably the same object
code would be generated for P0 either way, particularly if P1 was in a
separate compilation unit (link-time optimization notwithstanding).

> Btw, how to proceed for your SRCU patch and this one?
> Are you planning to make any changes? I think the version you have is ok if
> you don't think the patch is improved by anything I brought up.

I don't see any need to change the SRCU patch at this point, other than
to improve the attributions.

> Any additional concerns/changes for this patch?

It should give the same data-race diagnostics as the current LKMM. This
probably means the patch will need to punch up the definitions of
*-pre-bounded and *-post-bounded, unless you can think of a better
approach.

Alan