Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

From: Vincent Guittot
Date: Mon Jan 25 2021 - 06:01:19 EST


On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@xxxxxxxxx> wrote:
>
> A long-tail load balance cost is observed on the newly idle path,
> this is caused by a race window between the first nr_running check
> of the busiest runqueue and its nr_running recheck in detach_tasks.
>
> Before the busiest runqueue is locked, the tasks on the busiest
> runqueue could be pulled by other CPUs and nr_running of the busiest
> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED

We should better detect that when trying to detach task like below

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)

lockdep_assert_held(&env->src_rq->lock);

+ /*
+ * Another CPU has emptied this runqueue in the meantime.
+ * Just return and leave the load_balance properly.
+ */
+ if (env->src_rq->nr_running <= 1 && !env->loop) {
+ /* Clear the flag as we will not test any task */
+ env->flags &= ~LBF_ALL_PINNED;
+ return 0;
+ }
+
if (env->imbalance <= 0)
return 0;


> flag set, and triggers load_balance redo at the same sched_domain level.
>
> In order to find the new busiest sched_group and CPU, load balance will
> recompute and update the various load statistics, which eventually leads
> to the long-tail load balance cost.
>
> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
> redo times, combined with sysctl_sched_nr_migrate, the max load balance
> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
> 192 logical CPUs.
>
> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> Signed-off-by: Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae7ceba..b59f371 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7407,6 +7407,8 @@ struct lb_env {
> unsigned int loop;
> unsigned int loop_break;
> unsigned int loop_max;
> + unsigned int redo_cnt;
> + unsigned int redo_max;
>
> enum fbq_type fbq_type;
> enum migration_type migration_type;
> @@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
> return group_balance_cpu(sg) == env->dst_cpu;
> }
>
> +static const unsigned int sched_nr_lb_redo = 1;
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> .dst_grpmask = sched_group_span(sd->groups),
> .idle = idle,
> .loop_break = sched_nr_migrate_break,
> + .redo_max = sched_nr_lb_redo,
> .cpus = cpus,
> .fbq_type = all,
> .tasks = LIST_HEAD_INIT(env.tasks),
> @@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> * destination group that is receiving any migrated
> * load.
> */
> - if (!cpumask_subset(cpus, env.dst_grpmask)) {
> + if (!cpumask_subset(cpus, env.dst_grpmask) &&
> + ++env.redo_cnt < env.redo_max) {
> env.loop = 0;
> env.loop_break = sched_nr_migrate_break;
> goto redo;
> --
> 2.7.4
>