Re: [RFC] Potential problem in qspinlock due to mixed-size accesses

From: Thomas Haas
Date: Thu Jun 19 2025 - 10:12:05 EST




On 19.06.25 14:30, Will Deacon wrote:
On Tue, Jun 17, 2025 at 09:00:30PM +0200, Hernan Ponce de Leon wrote:
On 6/17/2025 4:23 PM, Thomas Haas wrote:


On 17.06.25 16:17, Will Deacon wrote:
On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
On 6/17/2025 8:19 AM, Thomas Haas wrote:
On 16.06.25 16:23, Will Deacon wrote:
I'm half inclined to think that the Arm memory model
should be tightened
here; I can raise that with Arm and see what they say.

Although the cited paper does give examples of store-forwarding from a
narrow store to a wider load, the case in qspinlock is further
constrained by having the store come from an atomic rmw and the load
having acquire semantics. Setting aside the MSA part,
that specific case
_is_ ordered in the Arm memory model (and C++ release
sequences rely on
it iirc), so it's fair to say that Arm CPUs don't permit
forwarding from
an atomic rmw to an acquire load.

Given that, I don't see how this is going to occur in practice.

You are probably right. The ARM model's atomic-ordered-before relation

       let aob = rmw | [range(rmw)]; lrs; [A | Q]

clearly orders the rmw-store with subsequent acquire loads
(lrs = local-
read-successor, A = acquire).
If we treat this relation (at least the second part) as a "global
ordering" and extend it by "si" (same-instruction), then the
problematic
reordering under MSA should be gone.
I quickly ran Dartagnan on the MSA litmus tests with this change to the
ARM model and all the tests still pass.

Even with this change I still get violations (both safety and
termination)
for qspinlock with dartagnan.

Please can you be more specific about the problems you see?

I talked to Hernán personally and it turned out that he used the generic
implementation of smp_cond_acquire (not sure if the name is correct)
which uses a relaxed load followed by a barrier. In that case, replacing
aob by aob;si does not change anything.
Indeed, even in the reported problem we used the generic implementation
(I was unaware of this), though it is easy to check that changing the
relaxed load to acquire does not give sufficient orderings.

Yes, my bad. I was using the generic header rather than the aarch64 specific
one and then the changes to the model were having not effect (as they
should).

Now I am using the aarch64 specific ones and I can confirm dartagnan still
reports the violations with the current model and making the change proposed
by Thomas (adding ;si just to the second part seems to be enough) indeed
removes all violations.

That's great! Thanks for working together to get to the bottom of it. I
was worried that this was the tip of the iceberg.

I'll try to follow-up with Arm to see if that ';si' addition is an
acceptable to atomic-ordered before. If not, we'll absorb the smp_wmb()
into xchg_release() with a fat comment about RCsc and mixed-size
accesses.

Will

I have already contacted ARM, but I think it will take some time to get an answer.
Interestingly, the definition of aob has changed in a newer version of the ARM cat model:

let aob = [Exp & M]; rmw; [Exp & M]
| [Exp & M]; rmw; lrs; [A | Q]
| [Imp & TTD & R]; rmw; [HU]

Ignoring all the extra Exp/Imp/TTD stuff, you can see that the second part of the definition only orders the load-part of the rmw with the following acquire load. This suggests that a form of store forwarding might indeed be expected, seeing that they deliberately removed the rmw_store-acq_load ordering they had previously.
Nevertheless, for qspinlock to work we just need the rmw_load-acq_load ordering that is still provided as long as we extend it with ";si".

--
=====================================

Thomas Haas

Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany

t.haas@xxxxxxxxxxxxxxxxxx
https://www.tu-braunschweig.de/tcs/team/thomas-haas