On 16.06.25 16:23, Will Deacon wrote:
On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
We have been taking a look if mixed-size accesses (MSA) can affect the
correctness of qspinlock.
We are focusing on aarch64 which is the only memory model with MSA support
[1].
For this we extended the dartagnan [2] tool to support MSA and now it
reports liveness, synchronization, and mutex issues.
Notice that we did something similar in the past for LKMM, but we were
ignoring MSA [3].
The culprit of all these issues is that atomicity of single load
instructions is not guaranteed in the presence of smaller-sized stores
(observed on real hardware according to [1] and Fig. 21/22)
Consider the following pseudo code:
int16 old = xchg16_rlx(&lock, 42);
int32 l = load32_acq(&lock);
Then the hardware can treat the code as (likely due to store- forwarding)
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
arithmetic
and reorder it to
int16 l2 = load16_acq(&lock + 2);
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
Now another thread can overwrite "lock" in between the first two accesses so
that the original l (l1 and l2) ends up containing
parts of a lock value that is older than what the xchg observed.
Oops :-(
(snip the excellent details)
### Solutions
The problematic executions rely on the fact that T2 can move half of its
load operation (1) to before the xchg_tail (3).
Preventing this reordering solves all issues. Possible solutions are:
- make the xchg_tail full-sized (i.e, also touch lock/pending bits).
Note that if the kernel is configured with >= 16k cpus, then the tail
becomes larger than 16 bits and needs to be encoded in parts of the pending
byte as well.
In this case, the kernel makes a full-sized (32-bit) access for the
xchg. So the above bugs are only present in the < 16k cpus setting.
Right, but that is the more expensive option for some.
- make the xchg_tail an acquire operation.
- make the xchg_tail a release operation (this is an odd solution by
itself but works for aarch64 because it preserves REL->ACQ ordering). In
this case, maybe the preceding "smp_wmb()" can be removed.
I think I prefer this one, it move a barrier, not really adding
additional overhead. Will?
I'm half inclined to think that the Arm memory model should be tightened
here; I can raise that with Arm and see what they say.
Although the cited paper does give examples of store-forwarding from a
narrow store to a wider load, the case in qspinlock is further
constrained by having the store come from an atomic rmw and the load
having acquire semantics. Setting aside the MSA part, that specific case
_is_ ordered in the Arm memory model (and C++ release sequences rely on
it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
an atomic rmw to an acquire load.
Given that, I don't see how this is going to occur in practice.
Will
You are probably right. The ARM model's atomic-ordered-before relation
let aob = rmw | [range(rmw)]; lrs; [A | Q]
clearly orders the rmw-store with subsequent acquire loads (lrs = local- read-successor, A = acquire).
If we treat this relation (at least the second part) as a "global ordering" and extend it by "si" (same-instruction), then the problematic reordering under MSA should be gone.
I quickly ran Dartagnan on the MSA litmus tests with this change to the ARM model and all the tests still pass.
We should definitely ask ARM about this. I did sent an email to Jade before writing about this issue here, but she was (and still is) busy and told me to ask at memory-model@xxxxxxx .
I will ask them.