Re: Plain accesses and data races in the Linux Kernel Memory Model

From: Andrea Parri
Date: Tue Jan 15 2019 - 09:25:58 EST


On Mon, Jan 14, 2019 at 02:41:49PM -0500, Alan Stern wrote:
> The patch below is my first attempt at adapting the Linux Kernel
> Memory Model to handle plain accesses (i.e., those which aren't
> specially marked as READ_ONCE, WRITE_ONCE, acquire, release,
> read-modify-write, or lock accesses). This work is based on an
> initial proposal created by Andrea Parri back in December 2017,
> although it has grown a lot since then.
>
> The adaptation involves two main aspects: recognizing the ordering
> induced by plain accesses and detecting data races. They are handled
> separately. In fact, the code for figuring out the ordering assumes
> there are no data races (the idea being that if a data race is
> present then pretty much anything could happen, so there's no point
> worrying about it -- obviously this will have to be changed if we want
> to cover seqlocks).
>
> This is a relativly major change to the model and it will require a
> lot of scrutiny and testing. At the moment, I haven't even tried to
> compare it with the existing model on our library of litmus tests.
>
> The difficulty with incorporating plain accesses in the memory model
> is that the compiler has very few constraints on how it treats plain
> accesses. It can eliminate them, duplicate them, rearrange them,
> merge them, split them up, and goodness knows what else. To make some
> sense of this, I have taken the view that a plain access can exist
> (perhaps multiple times) within a certain bounded region of code.
> Ordering of two accesses X and Y means that we guarantee at least one
> instance of the X access must be executed before any instances of the
> Y access. (This is assuming that neither of the accesses is
> completely eliminated by the compiler; otherwise there is nothing to
> order!)
>
> After adding some simple definitions for the sets of plain and marked
> accesses and for compiler barriers, the patch updates the ppo
> relation. The basic idea here is that ppo can be broken down into
> categories: memory barriers, overwrites, and dependencies (including
> dep-rfi).
>
> Memory barriers always provide ordering (compiler barriers do
> not but they have indirect effects).
>
> Overwriting always provides ordering. This may seem
> surprising in the case where both X and Y are plain writes,
> but in that case the memory model will say that X can be
> eliminated unless there is at least a compiler barrier between
> X and Y, and this barrier will enforce the ordering.
>
> Some dependencies provide ordering and some don't. Going by
> cases:
>
> An address dependency to a read provides ordering when
> the source is a marked read, even when the target is a
> plain read. This is necessary if rcu_dereference() is
> to work correctly; it is tantamount to assuming that
> the compiler never speculates address dependencies.
> However, if the source is a plain read then there is
> no ordering. This is because of Alpha, which does not
> respect address dependencies to reads (on Alpha,
> marked reads include a memory barrier to enforce the
> ordering but plain reads do not).
>
> An address dependency to a write always provides
> ordering. Neither the compiler nor the CPU can
> speculate the address of a write, because a wrong
> guess could generate a data race. (Question: do we
> need to include the case where the source is a plain
> read?)
>
> A data or control dependency to a write provides
> ordering if the target is a marked write. This is
> because the compiler is obliged to translate a marked
> write as a single machine instruction; if it
> speculates such a write there will be no opportunity
> to correct a mistake.
>
> Dep-rfi (i.e., a data or address dependency from a
> read to a write which is then read from on the same
> CPU) provides ordering between the two reads if the
> target is a marked read. This is again because the
> marked read will be translated as a machine-level load
> instruction, and then the CPU will guarantee the
> ordering.
>
> There is a special case (data;rfi) that doesn't
> provide ordering in itself but can contribute to other
> orderings. A data;rfi link corresponds to situations
> where a value is stored in a temporary shared variable
> and then loaded back again. Since the compiler might
> choose to eliminate the temporary, its accesses can't
> be said to be ordered -- but the accesses around it
> might be. As a simple example, consider:
>
> r1 = READ_ONCE(ptr);
> tmp = r1;
> r2 = tmp;
> WRITE_ONCE(*r2, 5);
>
> The plain accesses involving tmp don't have any
> particular ordering requirements, but we do know that
> the READ_ONCE must be ordered before the WRITE_ONCE.
> The chain of relations is:
>
> [marked] ; data ; rfi ; addr ; [marked]
>
> showing that a data;rfi has been inserted into an
> address dependency from a marked read to a marked
> write. In general, any number of data;rfi links can
> be inserted in each of the other kinds of dependencies.
>
> As mentioned above, ordering applies only in situations where the
> compiler has not eliminated either of the accesses. Therefore the
> memory model tries to identify which accesses must be preserved in
> some form by the compiler.
>
> All marked accesses must be preserved.
>
> A plain write is preserved unless it is overwritten by a
> po-later write with no compiler barrier in between. Even
> then, it must be preserved if it is read by a marked read, or
> if it is read by any preserved read on a different CPU.
>
> A plain read is not preserved unless it is the source of a
> dependency to a preserved access. Even then it is not
> preserved if it reads from a plain write on the same CPU with
> no compiler barrier in between.
>
> The hb (happens-before) relation is restricted to apply only to
> preserved accesses. In addition, the rfe and prop parts of hb are
> restricted to apply only to marked accesses.
>
> The last part of the patch checks for possible data races. A
> potential race is defined as two accesses to the same location on
> different CPUs, where at least one access is plain and at least one is
> a write. In order to qualify as an actual data race, a potential race
> has to fail to meet a suitable ordering requirement. This requirement
> applies in two distinct scenarios:
>
> X is a write and either Y reads from X or Y directly
> overwrites X (i.e., Y immediately follows X in the coherence
> order). In this case X must be visible to Y's CPU before Y
> executes.
>
> X is a read and there is an fre link from X to Y. In this
> case X must execute before Y.
>
> (I don't know whether the second scenario is really needed.
> Intuitively, it seems that if X doesn't execute before Y then there
> must be an alternate execution in which X reads from Y, which would
> then fall under the first scenario. But I can't prove this, or even
> express it precisely.)
>
> There are cases not covered by either scenario. For instance, X could
> be a write and Y could indirectly overwrite X, that is, there might be
> a third write W coherence-between X and Y. But in that case the
> concern is whether X races with W or W races with Y, not whether X
> races with Y. Similarly, X could be a read and Y could be a write
> coherence-before the write W which X reads from. Again, the concern
> would be whether X races with W or W races with Y, not whether X races
> with Y.
>
> Above, the phrase "X must be visible to Y's CPU before Y execute"
> means that there are preserved accesses X' and Y' such that:
>
> X' occurs at or after the end of the bounded region where
> X might execute, Y' occurs at or before the start of the
> bounded region where Y might execute, and X' propagates to Y's
> CPU before Y' executes. This last is defined by a new
> relation vis which has a relatively simple definition.
>
> Similarly, "X must execute before Y" means that there are preserved
> accesses X' and Y' such that:
>
> X' occurs at or after the end of the bounded region where
> X might execute, Y' occurs at or before the start of the
> bounded region where Y might execute, and X' is related to
> Y' by xb+ (where xb = hb | pb | rb).
>
> To use this, we have to have bounds for the region where an access
> might execute. For marked accesses the bounds are trivial: A marked
> access can execute only once, so it serves as its own bounds. However
> plain accesses can execute anywhere within a more extended region.
>
> If Y is plain, Y' is preserved, and Y' ->ppo Y then we know
> that Y' must execute at least once before any execution of Y.
> Thus Y' occurs at or before the start of the bounded region
> for Y.
>
> If X is plain, X' is preserved, and X' overwrites X or there
> is a memory barrier between X and X' then we know X cannot
> execute after any instance of X' has executed. Thus X' occurs
> at or after the end of the bounded region for X. (Exercise:
> Show that the case where X' is plain and overwrites X with no
> barriers in between doesn't lead to any problems.)
>
> If a potential race exists which does not meet the ordering
> requirement, the execution is flagged as containing a data race. This
> definition is less stringent that the one used in the C standard, but
> I think it should be suitable for the kernel.

[A resend, +LKML]

Unless I'm mis-reading/-applying this definition, this will flag the
following test (a variation on your "race.litmus") with "data-race":

C no-race

{}

P0(int *x, spinlock_t *s)
{
spin_lock(s);
WRITE_ONCE(*x, 1); /* A */
spin_unlock(s); /* B */
}

P1(int *x, spinlock_t *s)
{
int r1;

spin_lock(s); /* C */
r1 = *x; /* D */
spin_unlock(s);
}

exists (1:r1=1)

Broadly speaking, this is due to the fact that the modified "happens-
before" axiom does not forbid the execution with the (MP-) cycle

A ->po-rel B ->rfe C ->acq-po D ->fre A

and then to the link "D ->race-from-r A" here defined.

(In part., similar considerations hold for the following litmus test:

C MP1

{}

P0(int *x, int *y)
{
*x = 1;
smp_store_release(y, 1);
}

P1(int *x, int *y)
{
int r0;
int r1 = -1;

r0 = smp_load_acquire(y);
if (r0)
r1 = *x;
}

exists (1:r0=1 /\ 1:r1=0)

)

I wonder whether you actually intended to introduce these "races"...?

Andrea





>
> Here's a simple example illustrating the data race detection:
>
> $ cat race.litmus
> C race
>
> {}
>
> P0(int *x)
> {
> WRITE_ONCE(*x, 1);
> }
>
> P1(int *x)
> {
> int r1;
>
> r1 = *x;
> }
>
> exists (1:r1=1)
>
>
> $ herd7 -conf linux-kernel.cfg race.litmus
> Test race Allowed
> States 2
> 1:r1=0;
> 1:r1=1;
> Ok
> Witnesses
> Positive: 1 Negative: 1
> Flag data-race
> Condition exists (1:r1=1)
> Observation race Sometimes 1 1
> Time race 0.00
> Hash=5ad41863408315a556a8d38691c4924d
>
>
> Alan
>
>
>
> Index: usb-4.x/tools/memory-model/linux-kernel.bell
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.bell
> +++ usb-4.x/tools/memory-model/linux-kernel.bell
> @@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'releas
> enum Barriers = 'wmb (*smp_wmb*) ||
> 'rmb (*smp_rmb*) ||
> 'mb (*smp_mb*) ||
> + 'barrier (*barrier*) ||
> 'rcu-lock (*rcu_read_lock*) ||
> 'rcu-unlock (*rcu_read_unlock*) ||
> 'sync-rcu (*synchronize_rcu*) ||
> @@ -73,3 +74,8 @@ flag ~empty Srcu-unlock \ range(srcu-rsc
>
> (* Check for use of synchronize_srcu() inside an RCU critical section *)
> flag ~empty rcu-rscs & (po ; [Sync-srcu] ; po) as invalid-sleep
> +
> +(* Compute marked and plain memory accesses *)
> +let marked = IW | Once | Release | Acquire | domain(rmw) | range(rmw) |
> + LKR | LKW | UL | LF | RL | RU
> +let plain = M \ marked
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -37,6 +37,10 @@ let gp = po ; [Sync-rcu | Sync-srcu] ; p
>
> let strong-fence = mb | gp
>
> +(* Memory barriers are also compiler barriers *)
> +let barrier = fencerel(Barrier | Wmb | Rmb | Mb | Acquire | Release |
> + Sync-rcu | Sync-srcu)
> +
> (* Release Acquire *)
> let acq-po = [Acquire] ; po ; [M]
> let po-rel = [M] ; po ; [Release]
> @@ -58,24 +62,48 @@ empty rmw & (fre ; coe) as atomic
> (**********************************)
>
> (* Preserved Program Order *)
> -let dep = addr | data
> -let rwdep = (dep | ctrl) ; [W]
> +let data-rfi-star = (data ; rfi)*
> +let rwdep = data-rfi-star ; (addr | ((data | ctrl) ; [marked])) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi)
> -let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
> +let to-r = (data-rfi-star ; (addr | data) ; rfi ; [marked]) |
> + ([marked] ; data-rfi-star ; addr ; [R])
> +let fence = strong-fence | wmb | po-rel | rmb | acq-po |
> + (po-unlock-rf-lock-po & int)
> +let ppo = to-r | to-w | fence
> +
> +(*
> + * Preserved accesses.
> + *
> + * All marked writes are preserved.
> + * A plain write need not be preserved if it is overwritten before the
> + * next compiler barrier. But if it is read by a marked read or a
> + * preserved read on another CPU then it is preserved.
> + *
> + * All marked reads are preserved.
> + * A plain read is not preserved unless it is the source of a dependency
> + * to a preserved access. Even then, it isn't preserved if it reads from
> + * a plain write on the same CPU with no compiler barrier in between.
> + *)
> +let non-preserved-r = range(([plain] ; rfi) \ barrier)
> +let rec preserved = preserved-w | preserved-r
> + and preserved-w = (W & marked) |
> + (W \ domain(coi \ barrier)) | domain(rf ; [marked]) |
> + domain(rfe ; [preserved-r])
> + and preserved-r = (R & marked) |
> + (domain((to-w | to-r) ; [preserved]) \ non-preserved-r)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = rfe? ; r
> let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
> -let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
> +let prop = [marked] ; ((overwrite & ext)? ; cumul-fence* ; rfe?) ; [marked]
>
> (*
> * Happens Before: Ordering from the passage of time.
> * No fences needed here for prop because relation confined to one process.
> *)
> -let hb = ppo | rfe | ((prop \ id) & int)
> +let hb = ([preserved] ; ppo ; [preserved]) |
> + ([marked] ; rfe ; [marked]) | ((prop \ id) & int)
> acyclic hb as happens-before
>
> (****************************************)
> @@ -83,7 +111,7 @@ acyclic hb as happens-before
> (****************************************)
>
> (* Propagation: Each non-rf link needs a strong fence. *)
> -let pb = prop ; strong-fence ; hb*
> +let pb = prop ; strong-fence ; hb* ; [marked]
> acyclic pb as propagation
>
> (*******)
> @@ -131,7 +159,7 @@ let rec rcu-fence = rcu-gp | srcu-gp |
> (rcu-fence ; rcu-link ; rcu-fence)
>
> (* rb orders instructions just as pb does *)
> -let rb = prop ; po ; rcu-fence ; po? ; hb* ; pb*
> +let rb = prop ; po ; rcu-fence ; po? ; hb* ; pb* ; [marked]
>
> irreflexive rb as rcu
>
> @@ -143,3 +171,27 @@ irreflexive rb as rcu
> * let xb = hb | pb | rb
> * acyclic xb as executes-before
> *)
> +
> +
> +(*********************************)
> +(* Plain accesses and data races *)
> +(*********************************)
> +
> +(* Boundaries for lifetimes of plain accesses *)
> +let bound-before = [marked] | ([preserved] ; ppo)
> +let bound-after = [marked] | ((fri | coi | fence) ; [preserved])
> +
> +(* Visibility of a write *)
> +let xb = hb | pb | rb
> +let full-fence = strong-fence | (po ; rcu-fence ; po?)
> +let vis = cumul-fence* ; rfe? ; ((full-fence ; xb* ) | (xb* & int))
> +
> +(* Potential races *)
> +let pre-race = ext & ((plain * M) | ((M \ IW) * plain))
> +
> +(* Actual races *)
> +let coe-next = (coe \ (co ; co)) | rfe
> +let race-from-w = (pre-race & coe-next) \ (bound-after ; vis ; bound-before)
> +let race-from-r = (pre-race & fre) \ (bound-after ; xb+ ; bound-before)
> +
> +flag ~empty race-from-w | race-from-r as data-race
> Index: usb-4.x/tools/memory-model/linux-kernel.def
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.def
> +++ usb-4.x/tools/memory-model/linux-kernel.def
> @@ -24,6 +24,7 @@ smp_mb__before_atomic() { __fence{before
> smp_mb__after_atomic() { __fence{after-atomic}; }
> smp_mb__after_spinlock() { __fence{after-spinlock}; }
> smp_mb__after_unlock_lock() { __fence{after-unlock-lock}; }
> +barrier() { __fence{barrier}; }
>
> // Exchange
> xchg(X,V) __xchg{mb}(X,V)
>