Re: [PATCH v7 4/4] locking/rwsem: Enable direct rwsem lock handoff

From: Peter Zijlstra
Date: Mon Feb 13 2023 - 16:52:56 EST


On Mon, Feb 13, 2023 at 12:14:59PM -0500, Waiman Long wrote:

> > I am once again confused...
> >
> > *WHY* are you changing the writer wake-up path? The comments added here
> > don't clarify anything.
> >
> > If we set handoff, we terminate/disallow the spinning/stealing. The
> > direct consequence is that the slowpath/wait-list becomes the only way
> > forward.
> Yes, that is true.
> >
> > Since we don't take wait_lock on up, we fundamentally have a race
> > condition. But *WHY* do you insist on handling that in rwsem_wake()?
> > Delaying all that until rwsem_try_write_lock()? Doing so would render
> > pretty much all of the above pointless, no?
>
> There is an advantage in doing the handover earlier, if possible. A reader
> that comes in between can spoils the takeover of the rwsem in

How ?!? since all the stealing and spinning is out, the wait-list reigns
supreme. A new reader will queue.

Are you worried about the spurious elevation of ->count due to that
uncondition increment on down_read() ? This is always going to be the
case.

> rwsem_try_write_lock() and cause it to sleep again. Since we will have to
> take the wait lock anyway in rwsem_wake(), there isn't much additional cost
> to do some additional check.

There is a complexity cost -- and so far I've not seen a single line of
justification for the added complexity.

> Note that the kernel test robot had detected a 19.3% improvement of
> will-it-scale.per_thread_ops [1] due to this commit. That indicates this
> commit is good to have. I am planning to update the commit log to include
> that information as well as additional reasoning as discussed here.

Seen that; but who's saying a simpler implementation will not also have
those gains. And if not, then we have a clear separation into
functionality and optimization and justifications for it.

But now, we have neither. I'm not saying the patch is wrong -- I'm
saying it is needlessly complicated without justification.

All this stuff is hard enough -- we should really try to keep is as
simple as possible, having two distinct ways to wake up a writer is not
'as simple as possible'.

> > After all, rwsem_mark_wake() already wakes the writer if it is first,
> > no? Why invent yet another special way to wake up the writer.
> As I said before, waking up the writer does not mean it can always get the
> rwsem on the first rwsem_try_write_lock(). Doing early handoff in
> rwsem_wake() can remove that ambiguity.

But what if it is rwsem_down_read_slowpath() that drops ->count to 0 and
does rwsem_mark_wake() and misses your fancy new path?

That is, I'm thinking there's an argument to be had that rwsem_wake() is
fundamentally the wrong place to do anything.

OTOH, you are right in that rwsem_mark_wake() is in a special position;
it *KNOWS* the reader count has hit 0 and can ignore future spurious
increments because by knowing it was 0 it knows they *WILL* fail and
queue (provided WAITERS||HANDOFF) -- but I've not seen you
articulate this point.

(note how rwsem_cond_wake_waiter() relies on exactly this to select
WAKE_ANY)

You are also right in that having these different means of waking
readers and writers is 'confusing'.

But you failed to take things to their natural conclusion -- change the
way we do writer wakeups, all writer wakeups.

So change rwsem_mark_wake()'s writer wakeup and
rwsem_down_write_slowpath() to always so that waiter->task thing that
the readers already do.

It means putting all the logic for waking writers into
rwsem_mark_wake(), but then we can make full use of this 'we hit zero
future readers are temporary noise'. Althought I suspect you might need
to pass count into it, so we can observe the flags at the time 0 was
observed.

Is all that making sense? -- it has been a long day :-)

> > Also; and I asked this last time around; why do we care about the
> > handoff to writer *at*all* ? It is the readers that set HANDOFF.
>
> HANDOFF can happen for both readers and writers. Handoff to writer is
> actually more important than to readers.

Hmm, clearly I missed something. This is rwsem_try_write_lock() setting
it, right?