Re: [PATCH v7 4/4] locking/rwsem: Enable direct rwsem lock handoff

From: Waiman Long
Date: Mon Feb 13 2023 - 21:04:28 EST



On 2/13/23 16:52, Peter Zijlstra wrote:
On Mon, Feb 13, 2023 at 12:14:59PM -0500, Waiman Long wrote:

I am once again confused...

*WHY* are you changing the writer wake-up path? The comments added here
don't clarify anything.

If we set handoff, we terminate/disallow the spinning/stealing. The
direct consequence is that the slowpath/wait-list becomes the only way
forward.
Yes, that is true.
Since we don't take wait_lock on up, we fundamentally have a race
condition. But *WHY* do you insist on handling that in rwsem_wake()?
Delaying all that until rwsem_try_write_lock()? Doing so would render
pretty much all of the above pointless, no?
There is an advantage in doing the handover earlier, if possible. A reader
that comes in between can spoils the takeover of the rwsem in
How ?!? since all the stealing and spinning is out, the wait-list reigns
supreme. A new reader will queue.

Are you worried about the spurious elevation of ->count due to that
uncondition increment on down_read() ? This is always going to be the
case.

Yes, that is what I am talking about.



rwsem_try_write_lock() and cause it to sleep again. Since we will have to
take the wait lock anyway in rwsem_wake(), there isn't much additional cost
to do some additional check.
There is a complexity cost -- and so far I've not seen a single line of
justification for the added complexity.

Note that the kernel test robot had detected a 19.3% improvement of
will-it-scale.per_thread_ops [1] due to this commit. That indicates this
commit is good to have. I am planning to update the commit log to include
that information as well as additional reasoning as discussed here.
Seen that; but who's saying a simpler implementation will not also have
those gains. And if not, then we have a clear separation into
functionality and optimization and justifications for it.

But now, we have neither. I'm not saying the patch is wrong -- I'm
saying it is needlessly complicated without justification.

All this stuff is hard enough -- we should really try to keep is as
simple as possible, having two distinct ways to wake up a writer is not
'as simple as possible'.

OK, a simpler implementation will be to also check for handoff readiness in rwsem_wake() and set a flag in waiter structure. When the writer wakes up and calls  rwsem_try_write_lock(), it can ignore any spurious READER_BIAS and take over the lock directly if the flag is set. Will you be OK with that?

I believe the one of the reasons for the performance improvement reported by the test robot is that the check for handoff readiness in both rwsem_wake() and rwsem_try_write_lock() will increase the chance that a handoff will be successful in the first attempt. Another possible reason is that writer may not need to take the wait_lock anymore if all the handoff work is done by the unlocker at rwsem_wake(). That will be lost if we do a simpler solution as outlined above.


After all, rwsem_mark_wake() already wakes the writer if it is first,
no? Why invent yet another special way to wake up the writer.
As I said before, waking up the writer does not mean it can always get the
rwsem on the first rwsem_try_write_lock(). Doing early handoff in
rwsem_wake() can remove that ambiguity.
But what if it is rwsem_down_read_slowpath() that drops ->count to 0 and
does rwsem_mark_wake() and misses your fancy new path?

As said above, the current patch allows handoff to be done at both rwsem_wake() and rwsem_try_write_lock(). I haven't removed the code to do handoff in rwsem_try_write_lock().



That is, I'm thinking there's an argument to be had that rwsem_wake() is
fundamentally the wrong place to do anything.

OTOH, you are right in that rwsem_mark_wake() is in a special position;
it *KNOWS* the reader count has hit 0 and can ignore future spurious
increments because by knowing it was 0 it knows they *WILL* fail and
queue (provided WAITERS||HANDOFF) -- but I've not seen you
articulate this point.
My bad, I sometimes miss stating assumptions and rationale that can properly justify the patch. I usually need more back and fro like what we are doing now to fully write out all the missing pieces. So I really appreciate your time in reviewing the patch and your critiques :-)

(note how rwsem_cond_wake_waiter() relies on exactly this to select
WAKE_ANY)

You are also right in that having these different means of waking
readers and writers is 'confusing'.

But you failed to take things to their natural conclusion -- change the
way we do writer wakeups, all writer wakeups.

So change rwsem_mark_wake()'s writer wakeup and
rwsem_down_write_slowpath() to always so that waiter->task thing that
the readers already do.

It means putting all the logic for waking writers into
rwsem_mark_wake(), but then we can make full use of this 'we hit zero
future readers are temporary noise'. Althought I suspect you might need
to pass count into it, so we can observe the flags at the time 0 was
observed.

Is all that making sense? -- it has been a long day :-)

I guess it is possible to do that, but it may reduce the performance improvement due to optimistic spinning and it will become more like the old way of strictly following the wait queue order in granting lock. I need to run some tests to verify the performance impact.



Also; and I asked this last time around; why do we care about the
handoff to writer *at*all* ? It is the readers that set HANDOFF.
HANDOFF can happen for both readers and writers. Handoff to writer is
actually more important than to readers.
Hmm, clearly I missed something. This is rwsem_try_write_lock() setting
it, right?

Yes, handoff can be set and granted in rwsem_try_write_lock().

Cheers,
Longman