Re: [PATCH v3] locking/rtmutex: Limit # of lock stealing for non-RT waiters

From: Waiman Long
Date: Thu Jul 07 2022 - 15:25:47 EST


On 7/7/22 15:04, Boqun Feng wrote:
On Thu, Jul 07, 2022 at 02:45:10PM -0400, Waiman Long wrote:
On 7/7/22 14:22, Boqun Feng wrote:
On Wed, Jul 06, 2022 at 10:03:10AM -0400, Waiman Long wrote:
On 7/6/22 09:59, Waiman Long wrote:
Commit 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock
stealing") allows unlimited number of lock stealing's for non-RT
tasks. That can lead to lock starvation of non-RT top waiter tasks if
there is a constant incoming stream of non-RT lockers. This can cause
rcu_preempt self-detected stall or even task lockup in PREEMPT_RT kernel.
For example,

[77107.424943] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 1249.921363] INFO: task systemd:2178 blocked for more than 622 seconds.

Avoiding this problem and ensuring forward progress by limiting the
number of times that a lock can be stolen from each waiter. This patch
sets a threshold of 32. That number is arbitrary and can be changed
if needed.

Fixes: 48eb3f4fcfd3 ("locking/rtmutex: Implement equal priority lock stealing")
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/locking/rtmutex.c | 9 ++++++---
kernel/locking/rtmutex_common.h | 8 ++++++++
2 files changed, 14 insertions(+), 3 deletions(-)

[v3: Increase threshold to 32 and add rcu_preempt self-detected stall]
Note that I decided to increase the threshold to 32 from 10 to reduce the
potential performance impact of this change, if any. We also found out that
this patch can fix some of the rcu_preempt self-detected stall problems that
we saw with the PREEMPT_RT kernel. So I added that information in the patch
description.

Have you considered (and tested) whether we can set the threshold
directly proportional to nr_cpu_ids? Because IIUC, the favorable case
for lock stealing is that every CPU gets a chance to steal once. If one
CPU can steal twice, 1) either there is a context switch between two
tasks, which costs similarly as waking up the waiter, or 2) a task drops
and re-graps a lock, which means the task wants to yield to other
waiters of the lock.
There is no inherent restriction on not allowing the same cpu stealing the
lock twice or more. With rtmutex, the top waiter may be sleeping and the
Well, I'm not saying we need to restrict the same cpu to steal a lock
twice or more. Think about this, when there is a task running on CPU 1
already steals a lock once, for example:

<lock release>
{task C is the top waiter}

CPU 1
=====
<now task A running>
lock(); // steal the lock
...
unlock():
// set owner to NULL
<switch task B> // similar cost to wake up A
lock(); // steal the lock

, which means if a CPU steals a lock twice or more, it's almost certain
that a context happened between two steals ("almost" because there could
be a case where task A lock()+unlock() twice, but as I said, it
means that task A is willing to yield.).

Therefore if there are @nr_cpu_ids lock steals, it means either there is
a context switch somewhere or a task has been willing to yield. And I
think it's a reasonable signal to stop lock stealing.

Thoughts?

The reality is that a task can acquire the same lock multiple times before a context switch. So I believe stealing a lock from the same sleeping top waiter multiple times can certainly happen. For a large SMP systems with hundred or even thousands of cpus, allowing that many lock stealing may significantly increase the lock acquisition latency for the unfortunate tasks.

Another alternative that I have done in the past is to put in a time stamp where a task become the top waiter and refrained from stealing the lock when the elapsed time from the time stamp exceeds a certain limit. That will limit the max lock acquisition latency of a non-RT task as long as there are no other RT tasks competing with it for the lock.

Cheers,
Longman