Re: [PATCH v2] locking/rwbase: Prevent indefinite writer starvation

From: Sebastian Andrzej Siewior
Date: Tue Jan 17 2023 - 09:25:07 EST


On 2023-01-17 08:38:17 [+0000], Mel Gorman wrote:
> rw_semaphore and rwlock are explicitly unfair to writers in the presense
> of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
> ("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
>
> The implementation is writer unfair, as it is not feasible to do
> priority inheritance on multiple readers, but experience has shown
> that real-time workloads are not the typical workloads which are
> sensitive to writer starvation.
>
> While atypical, it's also trivial to block writers with PREEMPT_RT
> indefinitely without ever making forward progress. Since LTP-20220121,
> the dio_truncate test case went from having 1 reader to having 16 readers
> and the number of readers is sufficient to prevent the down_write ever
> succeeding while readers exist. Eventually the test is killed after 30
> minutes as a failure.
>
> dio_truncate is not a realtime application but indefinite writer starvation

If so then the PI boosting would not work if we would have it ;)

> is undesirable. The test case has one writer appending and truncating files
> A and B while multiple readers read file A. The readers and writer are
> contending for one file's inode lock which never succeeds as the readers
> keep reading until the writer is done which never happens.

This tests the implementation of rwsem/ rwlock functionality to ensure
that it is not writer unfair.

> This patch records a timestamp when the first writer is blocked if no
> deadline or realtime task has recently acquired the lock for read. If
> dt/rt tasks are involved, then reader bias is preserved. For other tasks,
DL/ RT. Would it work to use the capital letters if it refers to the
scheduling class?

> reader bias is allowed until the first writer has been blocked for a minimum
> of 4ms or 1 tick. The cutoff time is arbitrary on the assumption that a
> normal application contending for 4ms also does not need PREEMPT_RT. On
> a test machine, the test completed in 88 seconds.

I would go for one second just because it _usually_ does not matter
since none of the important locks rely on that (as stated in the commit
message). But then why not use the 4ms/ 1 tick as you suggest. This is
after all what the NON-PREEMPT_RT implementation is using to ensure that
the writer is not stalled infinitely. The RWLOCK implementation is
already writer unfair.

Side note: If the test case gets updated to RT reader which acquire the
lock (the whole time) then they will block writer again :)

> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> ---
> include/linux/rwbase_rt.h | 3 ++
> kernel/locking/rwbase_rt.c | 84 +++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 82 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
> index 1d264dd08625..05c4dc74b8bd 100644
> --- a/include/linux/rwbase_rt.h
> +++ b/include/linux/rwbase_rt.h
> @@ -10,12 +10,14 @@
>
> struct rwbase_rt {
> atomic_t readers;
> + unsigned long waiter_blocked;
> struct rt_mutex_base rtmutex;
> };
>
> #define __RWBASE_INITIALIZER(name) \
> { \
> .readers = ATOMIC_INIT(READER_BIAS), \
> + .waiter_blocked = 0, \
> .rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex), \
> }
>
> @@ -23,6 +25,7 @@ struct rwbase_rt {
> do { \
> rt_mutex_base_init(&(rwbase)->rtmutex); \
> atomic_set(&(rwbase)->readers, READER_BIAS); \
> + (rwbase)->waiter_blocked = 0; \
> } while (0)
>
>
> diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> index c201aadb9301..db2f6accf49f 100644
> --- a/kernel/locking/rwbase_rt.c
> +++ b/kernel/locking/rwbase_rt.c
> @@ -39,7 +39,11 @@
> * major surgery for a very dubious value.
> *
> * The risk of writer starvation is there, but the pathological use cases
> - * which trigger it are not necessarily the typical RT workloads.
> + * which trigger it are not necessarily the typical RT workloads. The worst
> + * case of indefinite starvation of a writer will force readers into the
> + * slow path if a writer is blocked for more than RW_CONTENTION_THRESHOLD
> + * jiffies unless dl/rt tasks have taken a read lock since the last write
DL/RT please.

> + * unlock.
> *
> * Fast-path orderings:
> * The lock/unlock of readers can run in fast paths: lock and unlock are only
> @@ -65,6 +69,61 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
> return 0;
> }
>
> +/*
> + * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.

* This matches RWSEM_WAIT_TIMEOUT for the generic RWSEM
* implementation.

> + * The granularity is not exact as the lowest bit in rwbase_rt->waiter_blocked
> + * is used to detect recent rt/dl tasks taking a read lock.
> + */
> +#define RW_CONTENTION_THRESHOLD (HZ/250+1)
DIV_ROUND_UP(HZ, 250)

> +static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
> +{
> + /* No update required if dl/rt tasks already identified. */
> + if (rwb->waiter_blocked & 1)
> + return;
> +
> + /*
> + * Record a dl/rt task acquiring the lock for read. This may result
DL/RT
> + * in indefinite writer starvation but dl/rt tasks should avoid such
> + * behaviour.
> + */
> + if (dl_task(current) || rt_task(current)) {

There is also task_is_realtime(). But using only rt_task() should work
since it also covers dl_task().

> + struct rt_mutex_base *rtm = &rwb->rtmutex;
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&rtm->wait_lock, flags);
> + rwb->waiter_blocked |= 1;
> + raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> + }
> +}
> +
> +/* rtmutex->wait_lock must be held. */
> +static void __sched set_writer_blocked(struct rwbase_rt *rwb)
> +{
> + /*
> + * Lowest bit preserved to identify recent rt/dl tasks acquiring
> + * the lock for read so guarantee at least one tick delay.
> + */
> + rwb->waiter_blocked |= (jiffies + 2) & ~1UL;

I'm unsure what |= means in terms of multiple writers. It seems to
extend the wait period and the second writer has none after the first
one leaves.

> +}
> +
> +static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
> +{
> + /*
> + * Allow reader bias if a dl or rt task took the lock for read
> + * since the last write unlock. Such tasks should be designed
> + * to avoid heavy writer contention or indefinite starvation.
> + */
> + if (rwb->waiter_blocked & 1)
> + return true;
> +
> + /*
> + * Allow reader bias unless a writer has been blocked for more
> + * than RW_CONTENTION_THRESHOLD jiffies.
> + */
> + return jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD;

if you set
rwb->waiter_blocked = jiffies + RW_CONTENTION_THRESHOLD

then you could use
time_after(jiffies, waiter->waiter_blocked)

and we could name it timeout. So the first writer sets it and my guess
would be that the each RT reader ignores this delay while every non-RT
tries to acquire the lock unless as long as the timeout did not occur.
Then they back off and wait for one writer to acquire the lock.
I don't know what we do with the possible second writer but I guess
first writer on unlock should reset the timeout for the next writer. So
we have again reader followed by writer.

> +}
> +
> static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> unsigned int state)
> {
> @@ -74,9 +133,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> raw_spin_lock_irq(&rtm->wait_lock);
> /*
> * Allow readers, as long as the writer has not completely
> - * acquired the semaphore for write.
> + * acquired the semaphore for write and reader bias is still
> + * allowed.
> */
> - if (atomic_read(&rwb->readers) != WRITER_BIAS) {
> + if (atomic_read(&rwb->readers) != WRITER_BIAS &&
> + rwbase_allow_reader_bias(rwb)) {
> atomic_inc(&rwb->readers);
> raw_spin_unlock_irq(&rtm->wait_lock);
> return 0;
> @@ -140,10 +201,18 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
> unsigned int state)
> {
> + int ret;
> +
> if (rwbase_read_trylock(rwb))
> - return 0;
> + ret = 0;
> + else
> + ret = __rwbase_read_lock(rwb, state);
> +
> + /* Record if the current task acquiring the lock is a dl/rt task. */
> + if (!ret)
> + update_dlrt_reader(rwb);
>
> - return __rwbase_read_lock(rwb, state);
> + return ret;
> }
>
> static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
> @@ -264,12 +333,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
> if (__rwbase_write_trylock(rwb))
> break;
>
> + /* Record first new read/write contention. */
> + set_writer_blocked(rwb);
> +
> raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> rwbase_schedule();
> raw_spin_lock_irqsave(&rtm->wait_lock, flags);
>
> set_current_state(state);
> }
> +
> + rwb->waiter_blocked = 0;
> rwbase_restore_current_state();
> trace_contention_end(rwb, 0);
>

Sebastian