Re: [PATCH] sched/fair: Don't increase sd->balance_interval on newidle balance

From: Valentin Schneider
Date: Wed Sep 26 2018 - 05:35:13 EST


Hi,

On 26/09/18 09:13, Vincent Guittot wrote:
> On Tue, 25 Sep 2018 at 19:38, Valentin Schneider
> <valentin.schneider@xxxxxxx> wrote:
>>
>> When load_balance() fails to move some load because of task affinity,
>> we end up increasing sd->balance_interval to delay the next periodic
>> balance in the hopes that next time we look, that annoying pinned
>> task(s) will be gone.
>
> It's not about hope but to minimize useless periodic load balance as
> there is no possibility to balance anything
>
>>
>> However, idle_balance() pays no attention to sd->balance_interval, yet
>> it will still lead to an increase in balance_interval in case of
>> pinned tasks.
>
> I would say that this makes sense as there is no way to pull the
> pinned task on this rq so running periodic load balance (busy or idle)
> is a waste of time and resources
>

Increasing the delay for the next periodic balance when there's pinned tasks
makes sense. The issue I have here is that the balance_interval can potentially
be doubled at every failed idle_balance() without waiting for a sufficient
amount of time.

When going through periodic balance, load_balance() is gated by:

time_after_eq(jiffies, sd->last_balance + interval)

So if we previously ran load_balance() and failed to pull anything because
of task affinity, we'll wait for at least the new interval before attempting
another load_balance().

On the other hand, idle_balance() will call load_balance() regardless of
balance_interval values. So if the task composition allows it, we can repeatedly
fail to load_balance() and thus double balance_interval at every idle_balance()
"tick". So you can go from 8ms to 512ms in a few successive idle_balance()s,
whereas periodic balance would have to wait for at least the sum of all
intermediate balance_intervals leading to 512 (8, 16, 32, etc).

>>
>> If we're going through several newidle balances (e.g. we have a
>> periodic task), this can lead to a huge increase of the
>
> This only happen when overload is set which means that there are tasks
> on another CPU but this one can't pull one of them.
>
> Can you give us details about the use case that you care about ?
>

It's the same as I presented last week - devlib (some python target communication
library I use) has some phase where it spawns at lot of tasks at once to do
some setup (busybox, shutils, bash...). Some of those tasks are pinned to a
particular CPU, and that can lead to failed load_balance() - and to make things
worse, there's a lot of idle_balance() in there.

Eventually when I start running my actual workload a few ~100ms later, it's
impacted by that balance_interval increase.

Admittedly that's a specific use-case, but I don't think this quick increase
is something that was intended.

> Also, If the interval reaches a value that becomes a problem for you
> why don't you reduce the max_interval with sysfs ? This max interval
> can be reach in several other occasions
>

True, although I saw that as an unwanted behaviour rather than a tuning problem.

>> balance_interval in a very small amount of time.
>>
>> To prevent that, don't increase the balance interval when going
>> through a newidle balance.
>>
>> Signed-off-by: Valentin Schneider <valentin.schneider@xxxxxxx>
>> ---
>> kernel/sched/fair.c | 19 ++++++++++++++-----
>> 1 file changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6bd142d..620910d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8782,13 +8782,22 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>> sd->nr_balance_failed = 0;
>>
>> out_one_pinned:
>> + ld_moved = 0;
>> +
>> + /*
>> + * idle_balance() disregards balance intervals, so we could repeatedly
>> + * reach this code, which would lead to balance_interval skyrocketting
>> + * in a short amount of time. Skip the balance_interval increase logic
>> + * to avoid that.
>> + */
>> + if (env.idle == CPU_NEWLY_IDLE)
>> + goto out;
>> +
>> /* tune up the balancing interval */
>> - if (((env.flags & LBF_ALL_PINNED) &&
>> - sd->balance_interval < MAX_PINNED_INTERVAL) ||
>> - (sd->balance_interval < sd->max_interval))
>> + if ((env.flags & LBF_ALL_PINNED &&
>> + sd->balance_interval < MAX_PINNED_INTERVAL) ||
>> + (sd->balance_interval < sd->max_interval))
>
> This looks like a code cleanup that is not related to the subject
>

Absolutely, I figured I could get away with it but if that's preferred I can
split that more clearly

>> sd->balance_interval *= 2;
>> -
>> - ld_moved = 0;
>> out:
>> return ld_moved;
>> }
>> --
>> 2.7.4
>>