Re: [PATCH] locking/qspinlock: Optimize pending state waiting for unlock

From: Waiman Long
Date: Sat Dec 24 2022 - 21:13:00 EST


On 12/24/22 07:05, guoren@xxxxxxxxxx wrote:
From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>

When we're pending, we only care about lock value. The xchg_tail
wouldn't affect the pending state. That means the hardware thread
could stay in a sleep state and leaves the rest execution units'
resources of pipeline to other hardware threads. This optimization
may work only for SMT scenarios because the granularity between
cores is cache-block.

Signed-off-by: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
Signed-off-by: Guo Ren <guoren@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
---
kernel/locking/qspinlock.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 2b23378775fe..ebe6b8ec7cb3 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -371,7 +371,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
/*
* We're pending, wait for the owner to go away.
*
- * 0,1,1 -> 0,1,0
+ * 0,1,1 -> *,1,0
*
* this wait loop must be a load-acquire such that we match the
* store-release that clears the locked bit and create lock
Yes, we don't care about the tail.
@@ -380,7 +380,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* barriers.
*/
if (val & _Q_LOCKED_MASK)
- atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
+ smp_cond_load_acquire(&lock->locked, !VAL);
/*
* take ownership and clear the pending bit.

We may save an AND operation here which may be a cycle or two.  I remember that it may be more costly to load a byte instead of an integer in some arches. So it doesn't seem like that much of an optimization from my point of view. I know that arm64 will enter a low power state in this *cond_load_acquire() loop, but I believe any change in the state of the the lock cacheline will wake it up. So it doesn't really matter if you are checking a byte or an int.

Do you have any other data point to support your optimization claim?

Cheers,
Longman