Re: [PATCH v2 1/3] sched/fair: fix rounding issue for asym packing

From: Valentin Schneider
Date: Tue Dec 18 2018 - 12:32:11 EST


On 14/12/2018 16:01, Vincent Guittot wrote:
> When check_asym_packing() is triggered, the imbalance is set to :
> busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE
> busiest_stat.avg_load also comes from a division and the final rounding
> can make imbalance slightly lower than the weighted load of the cfs_rq.
> But this is enough to skip the rq in find_busiest_queue and prevents asym
> migration to happen.
>
> Add 1 to the avg_load to make sure that the targeted cpu will not be
> skipped unexpectidly.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ca46964..c215f7a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8215,6 +8215,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> /* Adjust by relative CPU capacity of the group */
> sgs->group_capacity = group->sgc->capacity;
> sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
> + /*
> + * Prevent division rounding to make the computation of imbalance
> + * slightly less than original value and to prevent the rq to be then
> + * selected as busiest queue
> + */
> + sgs->avg_load += 1;

I've tried investigating this off-by-one issue by running lmbench, but it
seems to be a gnarly one.



The adventure starts in update_sg_lb_stats() where we compute
sgs->avg_load:

sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Then we drop by find_busiest_group() and call check_asym_packing() which,
if we do need to pack, does:

env->imbalance = DIV_ROUND_CLOSEST(
sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,
SCHED_CAPACITY_SCALE);

And finally the one check that seems to be failing, and that you're
addressing with a +1 is in find_busiest_queue():

if (rq->nr_running == 1 && wl > env->imbalance &&
!check_cpu_capacity(rq, env->sd))
continue;



Now, running lmbench on my HiKey960 hacked up to use asym packing, I
get an example where there's a task that should be migrated via asym
packing but isn't:

# This a DIE level balance
update_sg_lb_stats():
cpu=0 load=1
cpu=1 load=1023
cpu=2 load=0
cpu=3 load=0

avg_load=568

# Busiest group is [0-3]
find_busiest_group(): check_asym_packing():
env->imbalance = 1022

find_busiest_queue():
cpu=0 wl=1
cpu=1 wl=1023
cpu=2 wl=0
cpu=3 wl=0

Only CPU1 has rq->nr_running > 0 (==1 actually), but wl > env->imbalance
so we skip it in find_busiest_queue().

Now, I thought it was due to IRQ/rt pressure - 1840 isn't the maximum
group capacity for the LITTLE cluster, it should be (463 * 4) == 1852.
I got curious and threw random group capacity values in that equation while
keeping the same avg_load [1], and it gets wild: you have a signal that
oscillates between 1024 and 1014!

[1]: https://gist.github.com/valschneider/f42252e83be943d276b901f57734b8ba

Adding +1 to avg_load doesn't solve those pesky rounding errors that get
worse as group_capacity grows, but it reverses the trend: the signal
oscillates above 1024.



So in the end a +1 *could* "fix" this, but
a) It'd make more sense to have it in the check_asym_packing() code
b) It's ugly and it makes me feel like this piece of code is flimsy AF.

>
> if (sgs->sum_nr_running)
> sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
>