Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
From: Andrea Parri
Date: Fri Jun 13 2025 - 07:17:56 EST
> (snip the excellent details)
Indeed, joining in praising this report - Great work, Thomas!
> > ### Solutions
> >
> > The problematic executions rely on the fact that T2 can move half of its
> > load operation (1) to before the xchg_tail (3).
> > Preventing this reordering solves all issues. Possible solutions are:
> > - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> > Note that if the kernel is configured with >= 16k cpus, then the tail
> > becomes larger than 16 bits and needs to be encoded in parts of the pending
> > byte as well.
> > In this case, the kernel makes a full-sized (32-bit) access for the
> > xchg. So the above bugs are only present in the < 16k cpus setting.
>
> Right, but that is the more expensive option for some.
>
> > - make the xchg_tail an acquire operation.
> > - make the xchg_tail a release operation (this is an odd solution by
> > itself but works for aarch64 because it preserves REL->ACQ ordering). In
> > this case, maybe the preceding "smp_wmb()" can be removed.
>
> I think I prefer this one, it move a barrier, not really adding
> additional overhead. Will?
>
> > - put some other read-read barrier between the xchg_tail and the load.
> >
> >
> > ### Implications for qspinlock executed on non-ARM architectures.
> >
> > Unfortunately, there are no MSA extensions for other hardware memory models,
> > so we have to speculate based on whether the problematic reordering is
> > permitted if the problematic load was treated as two individual
> > instructions.
> > It seems Power and RISCV would have no problem reordering the instructions,
> > so qspinlock might also break on those architectures.
>
> Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
> word LL/SC and should be good.
>
> But yes, ZABHA might be equally broken.
RISC-V forbids store-forwarding from AMOs or SCs, certain (non-normative)
commentary in the spec clarifies that the same ordering rule applies when
the memory accesses in question only overlap partially.
I am not aware of any "RISC-V implementation" manifesting the load-load
re-ordering in question. IAC, notice that making xchg_tail() a release
operation might not suffice to fix such an implementation given that the
arch has no plain load-acquire instruction yet and relies on the generic
(fence-based) code for atomic_cond_read_acquire().
Andrea