Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff

From: Waiman Long
Date: Tue Jan 24 2023 - 20:54:41 EST


On 1/24/23 07:29, Peter Zijlstra wrote:
On Mon, Jan 23, 2023 at 12:30:59PM -0500, Waiman Long wrote:
On 1/23/23 09:59, Peter Zijlstra wrote:
On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
The lock handoff provided in rwsem isn't a true handoff like that in
the mutex. Instead, it is more like a quiescent state where optimistic
spinning and lock stealing are disabled to make it easier for the first
waiter to acquire the lock.

Reworking the code to enable a true lock handoff is more complex due to
the following facts:
1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
is too expensive to always take the wait_lock in the unlock path
to prevent racing.
Specifically, the issue is that we'd need to turn the
atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
loop, like:

tmp = atomic_long_read(&sem->count);
do {
if (tmp & (WAITERS|HANDOFF))
return slow_unock();
} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});

in order to not race with a concurrent setting of the HANDOFF bit,
right? And we don't like to turn unlock into a cmpxchg loop.

(OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
unlock, any fail and we go to the slow path -- I suppose the distinct
difference is that we sorta expect some contention on the read side)
I see that your inclination is to do the handoff right at the unlock time.
It is certainly possible to do that, but it will be more complex in the case
of rwsem than mutex.
Still, it would make things ever so much simpler -- but I agree we'll
probably not get away with it on the performance side of things.

Right. In short:

Having HANDOVER set:
- implies WAITERS set
- disables all fastpaths (spinning or otherwise)
- dis-allows anybody except first waiter to obtain lock

Therefore, having the window between clearing owner and prodding first
waiter is 'harmless'.
As said above, we need to confirm that the HANDOFF bit is set with wait_lock
held. Now, the HANDOFF bit may not set at unlock time, or it may not be.

We can pass the count value fetched at unlock time down to rwsem_wake() to
confirm that HANDOFF bit is set at unlock time. However, it is also possible
that the original waiter that set HANDOFF have bailed out, then a reader
acquire the lock and another waiter set HANDOFF before the unlocker acquire
the wait lock. Then the rwsem is really reader-owned in this case. So we
can't perform handoff. That is why I also check for if there is an active
lock (mostly read lock) at rwsem_wake(). However, that can be a false
negative because an incoming reader may have just added a READER_BIAS which
is to be removed soon. That is the reason why I have a secondary handoff
check in the reader slowpath.

With true lock handoff, there is no need to do a NULL owner spinning
anymore as wakeup will be performed if handoff is possible. So it
is likely that the first waiter won't actually go to sleep even when
schedule() is called in this case.
Right, removing that NULL spinning was the whole purpose -- except I
seem to have forgotten why it was a problem :-)

OK, lemme go read the actual patch.

Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
above, you added a whole intermediate GRANTED state. Why?

Since we fundamentally must deal with the release->wait_lock hole, why
do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
Why can't we skip the whole rwsem_wake()->GRANTED part and only do
handoff in the slowpath?
First of all, the owner value for a reader-owned rwsem is mostly of an
advisory value as it holds one of the possible owners. So it may be a bit
risky to use it as an indication that a handoff had happened as it may be
screwed up in some rare cases. That is why I use the repurposed
handoff_state value in the waiter structure. Also reading this value is less
costly than reading the rwsem cacheline which can be heavily contended.

I will update the patch description to highlight the points that I discussed
in this email.
Maybe I'm being dense, but I'm not seeing it. If we make HANDOFF block
all the fastpaths, all the spinning, all the stealing, everything; then
all that is left is the slowpath that is holding wait_lock.

Then in both slowpaths, ensure only the first waiter can go on and we're
done.

What am I missing? Why does it need to be so complicated?

That is, afaict something like the below would actually work, no? Yes,
simply deleting that spinning in write_slowpath isn't ideal, but I
suspect we can do something to rwsem_try_write_lock() to make up for
that if we think about it.

Again, please explain, explicitly and in small steps, why you think you
need all that complexity.

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -40,7 +40,7 @@
*
* When the rwsem is reader-owned and a spinning writer has timed out,
* the nonspinnable bit will be set to disable optimistic spinning.
-
+ *
* When a writer acquires a rwsem, it puts its task_struct pointer
* into the owner field. It is cleared after an unlock.
*
@@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_se
* Mark writer at the front of the queue for wakeup.
* Until the task is actually later awoken later by
* the caller, other writers are able to steal it.
+ *
+ * *Unless* HANDOFF is set, in which case only the
+ * first waiter is allowed to take it.
+ *
* Readers, on the other hand, will block as they
* will notice the queued writer.
*/
@@ -463,6 +467,9 @@ static void rwsem_mark_wake(struct rw_se
* force the issue.
*/
if (time_after(jiffies, waiter->timeout)) {
+ /*
+ * Setting HANDOFF blocks fastpaths and stealing.
+ */
if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
adjustment -= RWSEM_FLAG_HANDOFF;
lockevent_inc(rwsem_rlock_handoff);
@@ -471,6 +478,13 @@ static void rwsem_mark_wake(struct rw_se
}
atomic_long_add(-adjustment, &sem->count);
+
+ if (waiter->handoff_set) {
+ /*
+ * With HANDOFF set we must terminate all spinning.
+ */
+ rwsem_set_nonspinnable(sem);
+ }
return;
}
/*
@@ -844,7 +858,6 @@ static bool rwsem_optimistic_spin(struct
* Try to acquire the lock
*/
taken = rwsem_try_write_lock_unqueued(sem);
-
if (taken)
break;
@@ -1159,22 +1172,6 @@ rwsem_down_write_slowpath(struct rw_sema
if (signal_pending_state(state, current))
goto out_nolock;
- /*
- * After setting the handoff bit and failing to acquire
- * the lock, attempt to spin on owner to accelerate lock
- * transfer. If the previous owner is a on-cpu writer and it
- * has just released the lock, OWNER_NULL will be returned.
- * In this case, we attempt to acquire the lock again
- * without sleeping.
- */
- if (waiter.handoff_set) {
- enum owner_state owner_state;
-
- owner_state = rwsem_spin_on_owner(sem);
- if (owner_state == OWNER_NULL)
- goto trylock_again;
- }
-
schedule_preempt_disabled();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);

Thanks for the comments and suggested changes. I will adopt some of your suggestions to simplify the patchset. Will post a new version once I finish my testing.

Cheers,
Longman