[RFC] Potential problem in qspinlock due to mixed-size accesses
From: Thomas Haas
Date: Thu Jun 12 2025 - 11:08:45 EST
We have been taking a look if mixed-size accesses (MSA) can affect the
correctness of qspinlock.
We are focusing on aarch64 which is the only memory model with MSA
support [1].
For this we extended the dartagnan [2] tool to support MSA and now it
reports liveness, synchronization, and mutex issues.
Notice that we did something similar in the past for LKMM, but we were
ignoring MSA [3].
The culprit of all these issues is that atomicity of single load
instructions is not guaranteed in the presence of smaller-sized stores
(observed on real hardware according to [1] and Fig. 21/22)
Consider the following pseudo code:
int16 old = xchg16_rlx(&lock, 42);
int32 l = load32_acq(&lock);
Then the hardware can treat the code as (likely due to store-forwarding)
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
arithmetic
and reorder it to
int16 l2 = load16_acq(&lock + 2);
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
Now another thread can overwrite "lock" in between the first two
accesses so that the original l (l1 and l2) ends up containing
parts of a lock value that is older than what the xchg observed.
Let us now explain how this can break qspinlock.
We need 3 threads going the slowpath, T1, T2, and T3.
T1 is a setup-thread that takes the lock just so that T2 and T3 observe
contention and try to enqueue themselves.
So consider a situation where
- T3's node is the only enqueued node (due to previous contention
with T1).
- T2 is about to enqueue its node.
- T3 is about to take the lock because it sees no contention (lock
is free, no pending bits are set, it has the only node in the queue).
- the lock is released (T1 is already done)
We focus on the following lines of code in qspinlock.c:
// <- T2 is here, but executes part of line 328 first.
277: old = xchg_tail(lock, tail); // ~ xchg_rlx(&lock->tail,
tail)
// ...
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK));
// ...
// <- T3 is here
352: if ((val & _Q_TAIL_MASK) == tail) {
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
Then the following sequence of operations is problematic:
(1) T2 reads half of the lock (the half that is not lock->tail, i.e., it
"only" reads lock->lock_pending)
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK)); // P1
// Roughly: val_lock_pending = load_acquire(&lock->lock_pending);
T2 observes a free lock and no pending bits set.
(2) T3 takes the lock because it does not observe contention
352: if ((val & _Q_TAIL_MASK) == tail) { // Satisfied because
only T3's node is enqueued and lock is free
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
T3 clears the tail and claims the lock. The queue is empty now.
(3) T2 enqueues itself
277: old = xchg_tail(lock, tail);
T2 sees an empty queue (old == 0) because T3 just cleared it, and
enqueues its node.
(4) T2 reads the remaining half of the lock from (1), reading the tail
(lock->tail) it just inserted.
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK)); // P2
// Roughly: val_tail = load_acquire(&lock->tail);
T2 observes its own tail + lock is free and no pending bits are set
(from (1))
Now there are two continuations, one leading to broken synchronisation,
another leading to non-termination or failure of mutual exclusion.
We first consider the synchronisation issue.
(5a) T3 finishes its critical section and releases the lock
(6a) T2 takes the lock with the same code as in point (2):
352: if ((val & _Q_TAIL_MASK) == tail) { // Satisfied
because only T2's node is enqueued and lock is free
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
Notice that the "atomic_try_cmpxchg_relaxed" would fail if the
lock was still taken by T3, because "val" has no lock bits set (as T2
observed the lock to be free).
Although T2 now properly enters the CS after T3 has finished, the
synchronisation is broken since T2 did not perform an acquire operation
to synchronise with the lock release.
Indeed, dartagnan reports no safety violations (with 3 threads)
if the CAS is made an acquire or the CS itself contains an acquire
barrier (smb_rmb or smb_mb).
Now, let's consider the alternative continuation that leads to
non-termination or failure of mutual exclusion.
(5b) T2 tries to take the lock as above in (6a) but the CAS fails
because the lock is still taken by T3.
(6b) Due to the failing CAS, T2 observes contention (at least it thinks
so). It sets the lock (although T3 might still have it!) and waits until
the (non-existent) "contending thread" enqueues its node:
/*
* Either somebody is queued behind us or _Q_PENDING_VAL got set
* which will then detect the remaining tail and queue behind us
* ensuring we'll see a @next.
*/
362: set_locked(lock);
/*
* contended path; wait for next if not observed yet, release.
*/
367: if (!next)
368: next = smp_cond_load_relaxed(&node->next, (VAL));
The first comment suggests that the assumed situation is that either
another thread enqueued a node or the pending bits got set. But neither is
true in the current execution: we got here because the lock was taken.
Now there are two outcomes:
- Non-termination: the "smp_cond_load_relaxed" waits forever,
because there is no other thread that enqueues another node.
- Broken mutex: another thread (T4) enqueues a node and therefore
releases T2 from its loop. Now T2 enters the CS while T3 still executes
its CS.
Indeed, dartagnan reports this safety violation only with 4+
threads.
NOTE: For the above examples we forced all threads to take the slowpath
of qspinlock by removing the fastpath. With the fastpath present,
another thread (T0) is needed to force the other threads into the
slowpath, i.e., we need 4-5 threads to witness the issues.
### Solutions
The problematic executions rely on the fact that T2 can move half of its
load operation (1) to before the xchg_tail (3).
Preventing this reordering solves all issues. Possible solutions are:
- make the xchg_tail full-sized (i.e, also touch lock/pending bits).
Note that if the kernel is configured with >= 16k cpus, then the
tail becomes larger than 16 bits and needs to be encoded in parts of the
pending byte as well.
In this case, the kernel makes a full-sized (32-bit) access for
the xchg. So the above bugs are only present in the < 16k cpus setting.
- make the xchg_tail an acquire operation.
- make the xchg_tail a release operation (this is an odd solution
by itself but works for aarch64 because it preserves REL->ACQ ordering).
In this case, maybe the preceding "smp_wmb()" can be removed.
- put some other read-read barrier between the xchg_tail and the load.
### Implications for qspinlock executed on non-ARM architectures.
Unfortunately, there are no MSA extensions for other hardware memory
models, so we have to speculate based on whether the problematic
reordering is permitted if the problematic load was treated as two
individual instructions.
It seems Power and RISCV would have no problem reordering the
instructions, so qspinlock might also break on those architectures.
TSO, on the other hand, does not permit such reordering. Also, the
xchg_tail is a rmw operation which acts like a full memory barrier under
TSO, so even if load-load reordering was permitted, the rmw would
prevent this.
[1] https://dl.acm.org/doi/10.1145/3458926
[2] https://github.com/hernanponcedeleon/Dat3M
[3] https://lkml.org/lkml/2022/8/26/597
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@xxxxxxxxxxxxxxxxxx
https://www.tu-braunschweig.de/tcs/team/thomas-haas