[RFC] Potential problem in qspinlock due to mixed-size accesses

From: Thomas Haas
Date: Thu Jun 12 2025 - 11:08:45 EST

Next message: Arnaldo Carvalho de Melo: "[PATCH 1/1 fyi] tools headers UAPI: Sync the drm/drm.h with the kernel sources"
Previous message: Darrick J. Wong: "Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features"
Next in thread: Peter Zijlstra: "Re: [RFC] Potential problem in qspinlock due to mixed-size accesses"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

We have been taking a look if mixed-size accesses (MSA) can affect the correctness of qspinlock.
We are focusing on aarch64 which is the only memory model with MSA support [1].
For this we extended the dartagnan [2] tool to support MSA and now it reports liveness, synchronization, and mutex issues.
Notice that we did something similar in the past for LKMM, but we were ignoring MSA [3].

The culprit of all these issues is that atomicity of single load instructions is not guaranteed in the presence of smaller-sized stores (observed on real hardware according to [1] and Fig. 21/22)
Consider the following pseudo code:

    int16 old = xchg16_rlx(&lock, 42);
    int32 l = load32_acq(&lock);

Then the hardware can treat the code as (likely due to store-forwarding)

    int16 old = xchg16_rlx(&lock, 42);
    int16 l1 = load16_acq(&lock);
    int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer arithmetic

and reorder it to

    int16 l2 = load16_acq(&lock + 2);
    int16 old = xchg16_rlx(&lock, 42);
    int16 l1 = load16_acq(&lock);

Now another thread can overwrite "lock" in between the first two accesses so that the original l (l1 and l2) ends up containing
parts of a lock value that is older than what the xchg observed.

Let us now explain how this can break qspinlock.
We need 3 threads going the slowpath, T1, T2, and T3.
T1 is a setup-thread that takes the lock just so that T2 and T3 observe contention and try to enqueue themselves.
So consider a situation where
    - T3's node is the only enqueued node (due to previous contention with T1).
    - T2 is about to enqueue its node.
    - T3 is about to take the lock because it sees no contention (lock is free, no pending bits are set, it has the only node in the queue).
    - the lock is released (T1 is already done)

We focus on the following lines of code in qspinlock.c:

        // <- T2 is here, but executes part of line 328 first.
        277: old = xchg_tail(lock, tail);    // ~ xchg_rlx(&lock->tail, tail)
        // ...
        328: val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
        // ...
        // <- T3 is here
        352: if ((val & _Q_TAIL_MASK) == tail) {
                    if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
                        goto release; /* No contention */
        355: }

Then the following sequence of operations is problematic:

(1) T2 reads half of the lock (the half that is not lock->tail, i.e., it "only" reads lock->lock_pending)

        328: val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); // P1
        // Roughly: val_lock_pending = load_acquire(&lock->lock_pending);

     T2 observes a free lock and no pending bits set.

(2) T3 takes the lock because it does not observe contention

        352: if ((val & _Q_TAIL_MASK) == tail) { // Satisfied because only T3's node is enqueued and lock is free
                    if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
                        goto release; /* No contention */
        355: }

    T3 clears the tail and claims the lock. The queue is empty now.

(3) T2 enqueues itself

        277: old = xchg_tail(lock, tail);

    T2 sees an empty queue (old == 0) because T3 just cleared it, and enqueues its node.

(4) T2 reads the remaining half of the lock from (1), reading the tail (lock->tail) it just inserted.

        328: val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); // P2
        // Roughly: val_tail = load_acquire(&lock->tail);

    T2 observes its own tail + lock is free and no pending bits are set (from (1))

Now there are two continuations, one leading to broken synchronisation, another leading to non-termination or failure of mutual exclusion.
We first consider the synchronisation issue.

(5a) T3 finishes its critical section and releases the lock

(6a) T2 takes the lock with the same code as in point (2):

         352: if ((val & _Q_TAIL_MASK) == tail) {     // Satisfied because only T2's node is enqueued and lock is free
                    if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
                        goto release; /* No contention */
         355: }

      Notice that the "atomic_try_cmpxchg_relaxed" would fail if the lock was still taken by T3, because "val" has no lock bits set (as T2 observed the lock to be free).
      Although T2 now properly enters the CS after T3 has finished, the synchronisation is broken since T2 did not perform an acquire operation to synchronise with the lock release.
      Indeed, dartagnan reports no safety violations (with 3 threads) if the CAS is made an acquire or the CS itself contains an acquire barrier (smb_rmb or smb_mb).

Now, let's consider the alternative continuation that leads to non-termination or failure of mutual exclusion.

(5b) T2 tries to take the lock as above in (6a) but the CAS fails because the lock is still taken by T3.

(6b) Due to the failing CAS, T2 observes contention (at least it thinks so). It sets the lock (although T3 might still have it!) and waits until the (non-existent) "contending thread" enqueues its node:

        /*
         * Either somebody is queued behind us or _Q_PENDING_VAL got set
         * which will then detect the remaining tail and queue behind us
         * ensuring we'll see a @next.
         */
        362: set_locked(lock);

        /*
         * contended path; wait for next if not observed yet, release.
         */
        367: if (!next)
        368:    next = smp_cond_load_relaxed(&node->next, (VAL));

The first comment suggests that the assumed situation is that either another thread enqueued a node or the pending bits got set. But neither is
true in the current execution: we got here because the lock was taken.
Now there are two outcomes:
    - Non-termination: the "smp_cond_load_relaxed" waits forever, because there is no other thread that enqueues another node.
    - Broken mutex: another thread (T4) enqueues a node and therefore releases T2 from its loop. Now T2 enters the CS while T3 still executes its CS.
       Indeed, dartagnan reports this safety violation only with 4+ threads.

NOTE: For the above examples we forced all threads to take the slowpath of qspinlock by removing the fastpath. With the fastpath present, another thread (T0) is needed to force the other threads into the slowpath, i.e., we need 4-5 threads to witness the issues.

### Solutions

The problematic executions rely on the fact that T2 can move half of its load operation (1) to before the xchg_tail (3).
Preventing this reordering solves all issues. Possible solutions are:
    - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
      Note that if the kernel is configured with >= 16k cpus, then the tail becomes larger than 16 bits and needs to be encoded in parts of the pending byte as well.
      In this case, the kernel makes a full-sized (32-bit) access for the xchg. So the above bugs are only present in the < 16k cpus setting.
    - make the xchg_tail an acquire operation.
    - make the xchg_tail a release operation (this is an odd solution by itself but works for aarch64 because it preserves REL->ACQ ordering). In this case, maybe the preceding "smp_wmb()" can be removed.
    - put some other read-read barrier between the xchg_tail and the load.

### Implications for qspinlock executed on non-ARM architectures.

Unfortunately, there are no MSA extensions for other hardware memory models, so we have to speculate based on whether the problematic reordering is permitted if the problematic load was treated as two individual instructions.
It seems Power and RISCV would have no problem reordering the instructions, so qspinlock might also break on those architectures.
TSO, on the other hand, does not permit such reordering. Also, the xchg_tail is a rmw operation which acts like a full memory barrier under TSO, so even if load-load reordering was permitted, the rmw would prevent this.

[1] https://dl.acm.org/doi/10.1145/3458926
[2] https://github.com/hernanponcedeleon/Dat3M
[3] https://lkml.org/lkml/2022/8/26/597

--
=====================================

Thomas Haas

Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany

t.haas@xxxxxxxxxxxxxxxxxx
https://www.tu-braunschweig.de/tcs/team/thomas-haas

Next message: Arnaldo Carvalho de Melo: "[PATCH 1/1 fyi] tools headers UAPI: Sync the drm/drm.h with the kernel sources"
Previous message: Darrick J. Wong: "Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features"
Next in thread: Peter Zijlstra: "Re: [RFC] Potential problem in qspinlock due to mixed-size accesses"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]