Re: [PATCH] sched/fair: fix load_balance redo for null imbalance

From: Vincent Guittot
Date: Fri Sep 07 2018 - 09:08:49 EST


On Fri, 7 Sep 2018 at 14:56, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote:
> > Le Friday 07 Sep 2018 Ã 13:37:49 (+0200), Peter Zijlstra a Ãcrit :
> > > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> > > > It can happen that load_balance finds a busiest group and then a busiest rq
> > > > but the calculated imbalance is in fact null.
> > >
> > > Cute. Does that happen often?
> >
> > I have a use case with RT tasks that reproduces the problem regularly.
> > It happens at least when we have CPUs with different capacity either because
> > of heterogeous CPU or because of RT/DL reducing available capacity for cfs
> > I have put the call path that trigs the problem below and accroding to the
> > comment it seems that we can reach similar state when playing with priority.
> >
> > >
> > > > If the calculated imbalance is null, it's useless to try to find a busiest
> > > > rq as no task will be migrated and we can return immediately.
> > > >
> > > > This situation can happen with heterogeneous system or smp system when RT
> > > > tasks are decreasing the capacity of some CPUs.
> > >
> > > Is it the result of one of those "force_balance" conditions in
> > > find_busiest_group() ? Should we not fix that to then return NULL
> > > instead?
> >
> > The UC is:
> > We have a newly_idle load balance that is triggered when RT task becomes idle
> > ( but I think that I have seen that with idle load balance too)
> >
> > we trigs:
> > if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> > busiest->group_no_capacity)
> > goto force_balance;
> >
> > In calculate_imbalance we use the path
> > /*
> > * Avg load of busiest sg can be less and avg load of local sg can
> > * be greater than avg load across all sgs of sd because avg load
> > * factors in sg capacity and sgs with smaller group_type are
> > * skipped when updating the busiest sg:
> > */
> > if (busiest->avg_load <= sds->avg_load ||
> > local->avg_load >= sds->avg_load) {
> > env->imbalance = 0;
> > return fix_small_imbalance(env, sds);
> > }
> >
> > but fix_small_imbalance finally decides to return without modifying imbalance
> > like here
> > if (busiest->avg_load + scaled_busy_load_per_task >=
> > local->avg_load + (scaled_busy_load_per_task * imbn)) {
> > env->imbalance = busiest->load_per_task;
> > return;
> > }
>
> That one actually does modify imbalance :-) But I get your point.
>
> > Beside this patch, I'm preparing another patch in fix small imbalance to
> > ensure 1 task per CPU in similar situation but according to the comment above,
> > we can reach this situation because of tasks priority
>
> Didn't we all hate fix_small_imbalance() ?

yes. the rational of some decisions in the function are more and more
opaque to me. For now I'm trying to fix all UCs that I can find to
then try to get a generic rule
As an example, I have some situations (other than those discussed
here) in fix_small_imbalance where the task migration decision mainly
depends of a variation of +/-5 in the load of a task. This finally
generates good fairness between tasks but the root cause of this
fairness is a bit random.

>
> Anyway, I think I'd prefer something like the below; although it might
> be nicer to thread the return value through calculate_imbalance() and
> fix_small_imbalance(), but looking at them that's not going to be
> particularly nicer.
>
> Do you agree with this?, If so, I'll stick your orignal Changelog on it
> and pretend this is what you send me :-)

That's fine to me :-)

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b39fb596f6c1..0596a29f3d2a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> force_balance:
> /* Looks like there is an imbalance. Compute it */
> calculate_imbalance(env, &sds);
> - return sds.busiest;
> + return env->imbalance ? sds.busiest : NULL;
>
> out_balanced:
> env->imbalance = 0;