Re: [PATCH] sched/fair: favor non-idle group in tick preemption

From: Josh Don
Date: Fri Oct 28 2022 - 18:41:06 EST


On Thu, Oct 27, 2022 at 8:57 PM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
>
>
>
> 在 2022/10/28 07:34, Josh Don 写道:
> > Hi Chuyi,
> >
> > On Thu, Oct 27, 2022 at 1:16 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
> >>
> >> The non-idle se dominates competition vs the idle se when they
> >> are belong to the same group. We ensure that idle groups would not
> >> preempt non-idle group in wakeup preemption(see check_preempt_wakeup()).
> >> However, this can happen in tick preemption, since check_preempt_tick()
> >> dose not check current/se is idle or not. This patch adds this check.
> >>
> >> Signed-off-by: Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx>
> >> ---
> >> kernel/sched/fair.c | 12 +++++++++++-
> >> 1 file changed, 11 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index e4a0b8bd941c..f3324b8753b3 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4750,6 +4750,7 @@ static void
> >> check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >> {
> >> unsigned long ideal_runtime, delta_exec;
> >> + int cse_is_idle, pse_is_idle;
> >> struct sched_entity *se;
> >> s64 delta;
> >>
> >> @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >> if (delta < 0)
> >> return;
> >>
> >> - if (delta > ideal_runtime)
> >> + if (delta > ideal_runtime) {
> >> + /*
> >> + * Favor non-idle group even in tick preemption
> >> + */
> >> + cse_is_idle = se_is_idle(curr);
> >> + pse_is_idle = se_is_idle(se);
> >> + if (unlikely(!cse_is_idle && pse_is_idle))
> >> + return;
> >
> Hi Josh, thanks for your reply,
> > This would make it so that we never have tick based preemption of a
> > non-idle entity by an idle entity. That's a recipe for starvation of
> > the idle entity, if the non-idle entity is cpu bound.
> >
> > Beyond that though, I'm not quite sure the issue being solved here.
> > The large difference in weight between the idle and non-idle entity
> > means that the non-idle entity will not be preempted for quite a while
> > due to its ideal_runtime being quite high. The idle entity will
> > quickly be preempted on the next tick it takes due to the smaller
> > value of sysctl_sched_idle_min_granularity.
> >
> Actually, I did some tests and traced this issue. the result shows that
> this can happen with small probability. I also do some benchmarks. The
> result seems it has no performance harm, and we can reduce 2%~3%
> context switch when idle group & non-idle group are present at the same
> time. In addition, for throughput concern, I think we better let
> non-idle entity consume it's ideal_runtime. However, as you said, it may
> cause starvation of the idle entity.....

I don't doubt it improves the performance of the non-idle entity, but
it is never advisable to indefinitely starve threads. That's a recipe
for hard lockups, priority inversion, etc.

Does your non-idle entity have a reasonable number of cpu.shares? You
can push out the round-robin interval by tuning CFS parameters without
completely starving the idle entity.

> There is another question I would like to take this opportunity to
> consult you. In our production environment, in some cases, we want to
> adjust the weight/shares of the idle-cgroup which means we need to
> change the logic of sched_group_set_shares(), and it can increase the
> probability of the above issue. So, for what reasons did you prohibit
> users from changing weights of idle cgroup.

The reason for limiting the control of weight for idle cgroups is to
match the semantics of the per-task SCHED_IDLE api, which gives
SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that
these entities are intended to soak "unused" cpu cycles, and should
give minimal interference to any non-idle thread. However, we don't
have strict priority between idle and non-idle, due to the potential
for starvation issues.

Perhaps you could clarify your use case a bit further. Why do you want
to change the weight? Is it to adjust the competition between two idle
groups, or something else?

>
> Thanks again for your review.
>
> Best,
> Chuyi
> > The wakeup check is useful for latency sensitive non-idle tasks.
> > However, in steady state competition between idle and non-idle, we
> > must allow some amount of round-robin.
> >
> >> +
> >> resched_curr(rq_of(cfs_rq));
> >> + }
> >> }
> >>
> >> static void
> >> --
> >> 2.20.1
> >>
> >
> > Best,
> > Josh