Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff

From: Waiman Long
Date: Mon Jan 23 2023 - 12:33:28 EST


On 1/23/23 09:59, Peter Zijlstra wrote:
On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
The lock handoff provided in rwsem isn't a true handoff like that in
the mutex. Instead, it is more like a quiescent state where optimistic
spinning and lock stealing are disabled to make it easier for the first
waiter to acquire the lock.

Reworking the code to enable a true lock handoff is more complex due to
the following facts:
1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
is too expensive to always take the wait_lock in the unlock path
to prevent racing.
Specifically, the issue is that we'd need to turn the
atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
loop, like:

tmp = atomic_long_read(&sem->count);
do {
if (tmp & (WAITERS|HANDOFF))
return slow_unock();
} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});

in order to not race with a concurrent setting of the HANDOFF bit,
right? And we don't like to turn unlock into a cmpxchg loop.

(OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
unlock, any fail and we go to the slow path -- I suppose the distinct
difference is that we sorta expect some contention on the read side)

I see that your inclination is to do the handoff right at the unlock time. It is certainly possible to do that, but it will be more complex in the case of rwsem than mutex.

First of all for mutex, the owner is the lock word. You do a successful cmpxchg once and it is all done. rwsem isn't like that. Its owner field is separated from the actual lock word (count). So setting the right lock value and owner cannot be atomic. That is why I am using the wait_lock for synchronizing between the unlocker and the first waiter that set handoff. Since rwsem_wake() will take the wait_lock anyway, so I decide to do the handoff at that time.

Another consideration that I have is to minimize the unlock overhead. To do handoff at unlock time requires an extra read of the rwsem count.


2) The reader lock fast path may add a RWSEM_READER_BIAS at the wrong
time to prevent a proper lock handoff from a reader owned rwsem.
This would be much the same, right? We'd have to turn
rwsem_read_trylock() into a cmpxchg-loop and we don't like that.
Therefore we do that speculative add and fix up later.

Now, I'm not enturely sure what issue you allude to here; is the problem
that you can't quite tell when the last reader is gone?
A lock handoff can only be initiated when the following conditions are
true:
1) The RWSEM_FLAG_HANDOFF bit is set.
d'uh ;-)

2) The task to do the handoff don't see any other active lock
excluding the lock that it might have held.
2) here is the 2) above, right?

The new handoff mechanism performs handoff in rwsem_wakeup() to minimize
overhead. The rwsem count will be known at that point to determine if
handoff should be done. However, there is a small time gap between the
rwsem becomes free and the wait_lock is taken
Right, that's between atomic_long_fetch_add_release() and calling the
slow path because WAITERS bit is set.
Yes, there is no extra overhead unless the waiter bit is set.

where a reader can come in and add a RWSEM_READER_BIAS to the count or
Both 2s above.

the current first waiter can take the rwsem and clear
RWSEM_FLAG_HANDOFF in the interim.
The actual intended action.

That will fail the handoff operation.
I would not list that latter as a failure, it's exactly what we want to
happen, no?

Yes, that is the intended action.

I adds this due to the fact that even if the HANDOFF bit is observed to be set at unlock time, it does not guarantee a handoff can be successfully done because the first waiter can be interrupted out or killed in the interim. The HANDOFF bit has to be confirmed to be set while holding the wait lock to be sure that we can do a handoff.

To handle the former case, a secondary handoff will also be done in
the rwsem_down_read_slowpath() to catch it.
Right. In short:

Having HANDOVER set:
- implies WAITERS set
- disables all fastpaths (spinning or otherwise)
- dis-allows anybody except first waiter to obtain lock

Therefore, having the window between clearing owner and prodding first
waiter is 'harmless'.

As said above, we need to confirm that the HANDOFF bit is set with wait_lock held. Now, the HANDOFF bit may not set at unlock time, or it may not be.

We can pass the count value fetched at unlock time down to rwsem_wake() to confirm that HANDOFF bit is set at unlock time. However, it is also possible that the original waiter that set HANDOFF have bailed out, then a reader acquire the lock and another waiter set HANDOFF before the unlocker acquire the wait lock. Then the rwsem is really reader-owned in this case. So we can't perform handoff. That is why I also check for if there is an active lock (mostly read lock) at rwsem_wake(). However, that can be a false negative because an incoming reader may have just added a READER_BIAS which is to be removed soon. That is the reason why I have a secondary handoff check in the reader slowpath.


With true lock handoff, there is no need to do a NULL owner spinning
anymore as wakeup will be performed if handoff is possible. So it
is likely that the first waiter won't actually go to sleep even when
schedule() is called in this case.
Right, removing that NULL spinning was the whole purpose -- except I
seem to have forgotten why it was a problem :-)

OK, lemme go read the actual patch.

Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
above, you added a whole intermediate GRANTED state. Why?

Since we fundamentally must deal with the release->wait_lock hole, why
do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
Why can't we skip the whole rwsem_wake()->GRANTED part and only do
handoff in the slowpath?

First of all, the owner value for a reader-owned rwsem is mostly of an advisory value as it holds one of the possible owners. So it may be a bit risky to use it as an indication that a handoff had happened as it may be screwed up in some rare cases. That is why I use the repurposed handoff_state value in the waiter structure. Also reading this value is less costly than reading the rwsem cacheline which can be heavily contended.

I will update the patch description to highlight the points that I discussed in this email.

Cheers,
Longman