Re: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

From: Vincent Guittot
Date: Mon Apr 08 2024 - 03:15:07 EST


On Fri, 5 Apr 2024 at 19:16, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 04/05/24 14:15, Vincent Guittot wrote:
> > On Fri, 5 Apr 2024 at 00:05, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > >
> > > On 04/03/24 15:11, Vincent Guittot wrote:
> > > > On Wed, 3 Apr 2024 at 02:59, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > > > >
> > > > > For fair tasks inheriting the priority (nice) without reweighting is
> > > > > a NOP as the task's share won't change.
> > > >
> > > > AFAICT, there is no nice priority inheritance with rt_mutex; All nice
> > >
> > > Hmm from what I see there is
> > >
> > > > tasks are sorted with the same "default prio" in the rb waiter tree.
> > > > This means that the rt top waiter is not the cfs with highest prio but
> > > > the 1st cfs waiting for the mutex.
> > >
> > > This is about the order on which tasks contending for the lock more than the
> > > effective priority the task holding the lock should run at though, no?
> >
> > No, they are ordered by priority in the rb tree so you can get the
> > priority of the top waiter and apply it to the owner of the lock
>
> I think I see what you're getting at now. There's no guarantee the top waiter
> is the higher priority fair task. Yes.
>
> >
> > >
> > > >
> > > > >
> > > > > This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks
> > > > > with low priority values are susceptible to starvation leading to PI
> > > > > like impact on lock contention.
> > > > >
> > > > > The logic in rt_mutex will reset these low priority fair tasks into nice
> > > > > 0, but without the additional reweight operation to actually update the
> > > > > weights, it doesn't have the desired impact of boosting them to allow
> > > > > them to run sooner/longer to release the lock.
> > > > >
> > > > > Apply the reweight for fair_policy() tasks to achieve the desired boost
> > > > > for those low nice values tasks. Note that boost here means resetting
> > > > > their nice to 0; as this is what the current logic does for fair tasks.
> > > >
> > > > But you can at the opposite decrease the cfs prio of a task
> > > > and even worse with the comment :
> > > > /* XXX used to be waiter->prio, not waiter->task->prio */
> > > >
> > > > we use the prio of the top cfs waiter (ie the one waiting for the
> > > > lock) not the default 0 so it can be anything in the range [-20:19]
> > > >
> > > > Then, a task with low prio (i.e. nice > 0) can get a prio boost even
> > > > if this task and the waiter are low priority tasks
> > >
> > > I don't see this effect. The only change I am doing here
> > > is that when we set the prio that we are supposed to be inheriting, instead of
> > > simply changing prio, I also ensure we reweight so that we run at the inherited
> > > nice value. I am not changing how the waiter logic works.
> >
> > But if you look more deeply in the code, you will see that all cfs are
> > sorted with the same default prio so cfs tasks are not sorted and are
> > considered to be the same.
>
> Yes I saw that. We can potentially revert 715f7f9ece46 ("locking/rtmutex:
> Squash !RT tasks to DEFAULT_PRIO") ;-)
>
> /hides
>
> >
> > All that to say that I think the weight is not applied on purpose.
> > This might work for your particular case but there are more changes to
> > be done if you want to apply prio inheritance between cfs tasks.
> >
> > As an example, what about the impact of cgroup on the actual weight
> > and the inherited priority of a task ? If the owner and the waiter
> > don't belong to the same cgroup their own prio is meaningless... task
> > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > in a group with a weight equals to nice -20
>
> That is on my mind actually. But I thought it's a separate problem. That has to
> do with how we calculate the effective priority of the pi_task. And probably
> the sorting order to if we agree we need to revert the above. If that is done
> appropriately, I hope the current reweight approach could be used as-is. Hmm
> but but as I write this I realize the compound weight will still be different.
> Maybe we should inherit the weight rather than the prio, which IIUC should
> already be the effective weight taking cgroup into account?
>
> Just to put it out on the clear, you don't think the issue is wrong, but just
> that we need to look further for a proper fix, right? ie: it is a problem we
> should fix, but we need to nail down more details IIUC.

Yes, I agree about your problem but your current proposal is not
correct because there are more things to consider and fix

>
> If that's the case it'd be good to know what else beside sorting order and
> handling cgroup I need to take into account to make this more correct.
>
> There's the obvious SCHED_IDLE case of course that needs a policy promotion,
> beside weight adjustment.
>
> >
> >
> > >
> > > Here's my test app FWIW
> > >
> > > https://github.com/qais-yousef/pi_test
> > >
> > > When I run
> > >
> > > pi_test --lp-nice 0 --lp-nice 10
> > >
> > > the lp thread runs at 0 still
> > >
> > > If I do
> > >
> > > pi_test --lp-nice 10 --lp-nice 5
> > >
> > > low priority thread runs at 5
> > >
> > > What combination are you worried about? I can give it a try. I use
> > > sched-analyzer-pp [1] to see the division of runnable/running or you can
> > > monitor them on top
> > >
> > > #!/bin/bash
> > > set -eux
> > >
> > > sudo sched-analyzer &
> > >
> > > ./pi_test --lp-nice ${1:-10} --hp-nice ${2:-0} --affine-cpu ${3:-0} &
> > >
> > > sleep 10
> > >
> > > pkill -SIGKILL pi_test
> > >
> > > sudo pkill -SIGINT sched-analyzer
> > >
> > > sched-analyzer-pp --sched-states pi_test sched-analyzer.perfetto-trace
> > >
> > > Picutres of output is attached for before and after
> > >
> > > pi_test --lp-nice 10 --hp-nice 0
> > >
> > > [1] https://github.com/qais-yousef/sched-analyzer