Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

From: Michel Lespinasse
Date: Tue Mar 19 2013 - 19:48:36 EST


On Mon, Mar 18, 2013 at 6:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote:
>> On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
>> > We don't care about the ordering between multiple concurrent
>> > metadata modifications - what matters is whether the ongoing data IO
>> > around them is ordered correctly.
>>
>> Dave,
>>
>> The point that Michel is making is that there never was any ordering
>> guarantee by rwsem. It's an illusion.
>
> Weasel words.

Whoaaa, calm down.

You initially made one false statement (that the change meant a stream
of readers would starve a writer forever) and one imprecise statement
(that rwsem used to guarantee that readers don't skip ahead of writers
- this may be true in practice for your use case because the latencies
involved are very large compared to scheduling latencies, but that's a
very important qualification that needs to be added here). That
confused me enough that I initially couldn't tell what your actual
concern was, so I pointed out the source of my confusion and asked you
to clarify. It seems unfair to characterize that as "weasel words" -
I'm not trying to be a smartass here, but only to actually understand
your concern.

>> The reason is simple: to even get to the lock the cpu has to be
>> sleep-able. So for every submission that you believe is ordered, is by
>> its very nature __not ordered__, even when used by kernel code.
>>
>> Why? Because any thread on its way to claim the lock (reader or writer)
>> could be pre-empted for some other task, thus delaying the submission of
>> whatever i/o you believed to be ordered.
>
> You think I don't know this? You're arguing fine grained, low level
> behaviour between tasks is unpredictable. I get that. I understand
> that. But I'm not arguing about fine-grained, low level, microsecond
> semantics of the locking order....
>
> What you (and Michael) appear to be failing to see is what happens
> on a macro level when you have read locks being held for periods
> measured in *seconds* (e.g. direct IO gets queued behind a few
> thousand other IOs in the elevator waiting for a request slot),
> and the subsequent effect of inserting an operation that requires a
> write lock into that IO stream.
>
> IOWs, it simply doesn't matter if there's a micro-level race between
> the write lock and a couple of the readers. That's the level you
> guys are arguing at but it simply does not matter in the cases I'm
> describing. I'm talking about high level serialisation behaviours
> that might take of *seconds* to play out and the ordering behaviours
> observed at that scale.
>
> That is, I don't care if a couple of threads out of a few thousand
> race with the write lock over few tens to hundreds of microseconds,
> but I most definitely care if a few thousand IOs issued seconds
> after the write lock is queued jump over the write lock. That is a
> gross behavioural change at the macro-level.....

Understood. I accepted your concern and made sure my v2 proposal
doesn't do such macro level reordering.

>> So just to reiterate: there is no 'queue' and no 'barrier'. The
>> guarantees that rwsem makes are;
>> 1. Multiple readers can own the lock.
>> 2. Only a single writer can own the lock.
>> 3. Readers will not starve writers.
>
> You've conveniently ignored the fact that the current implementation
> also provides following guarantee:
>
> 4. new readers will block behind existing writers

In your use case, with large enough queue latencies, yes.

Please don't make it sound like this applies in every use case - it
has never applied for short (<ms) queue latencies, and you might
confuse people by making such unqualified statements.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/