Re: [PATCH] sched/fair: Load balance aggressively for SCHED_IDLE CPUs

From: Vincent Guittot
Date: Wed Jan 08 2020 - 03:05:36 EST


On Tue, 7 Jan 2020 at 13:42, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Dec 24, 2019 at 10:43:30AM +0530, Viresh Kumar wrote:
> > The fair scheduler performs periodic load balance on every CPU to check
> > if it can pull some tasks from other busy CPUs. The duration of this
> > periodic load balance is set to sd->balance_interval for the idle CPUs
> > and is calculated by multiplying the sd->balance_interval with the
> > sd->busy_factor (set to 32 by default) for the busy CPUs. The
> > multiplication is done for busy CPUs to avoid doing load balance too
> > often and rather spend more time executing actual task. While that is
> > the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
> > tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
> > tasks.
> >
> > With the recent enhancements in the fair scheduler around SCHED_IDLE
> > CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
> > CPU instead of other busy or idle CPUs. The same reasoning should be
> > applied to the load balancer as well to make it migrate tasks more
> > aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
> > latency of the migrated (SCHED_OTHER) tasks.
> >
> > This patch makes minimal changes to the fair scheduler to do the next
> > load balance soon after the last non SCHED_IDLE task is dequeued from a
> > runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
> > ignored while calculating the balance_interval for such CPUs. This is
> > done to avoid delaying the periodic load balance by few hundred
> > milliseconds for SCHED_IDLE CPUs.
> >
> > This is tested on ARM64 Hikey620 platform (octa-core) with the help of
> > rt-app and it is verified, using kernel traces, that the newly
> > SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
> > and pulls tasks from other busy CPUs.
>
> Nothing seems really objectionable here; I have a few comments below.
>
> Vincent?

The change makes sense to me. This should fix the last remaining long
scheduling latency of SCHED_OTHER tasks in presence of SCHED_IDLE
tasks

With the change proposed by Peter below you can add my
Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

>
>
> > @@ -5324,6 +5336,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > struct sched_entity *se = &p->se;
> > int task_sleep = flags & DEQUEUE_SLEEP;
> > int idle_h_nr_running = task_has_idle_policy(p);
> > + bool was_sched_idle = sched_idle_rq(rq);
> >
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> > @@ -5370,6 +5383,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!se)
> > sub_nr_running(rq, 1);
> >
> > + /* balance early to pull high priority tasks */
> > + if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
> > + rq->next_balance = jiffies;
> > +
> > util_est_dequeue(&rq->cfs, p, task_sleep);
> > hrtick_update(rq);
> > }
>
> This can effectively set ->next_balance in the past, but given we only
> tickle the balancer on every jiffy edge, that is of no concern. It just
> made me stumble when reading this.
>
> Not sure it even deserves a comment or not..
>
> > @@ -9531,6 +9539,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> > {
> > int continue_balancing = 1;
> > int cpu = rq->cpu;
> > + int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> > unsigned long interval;
> > struct sched_domain *sd;
> > /* Earliest time when we have to do rebalance again */
> > @@ -9567,7 +9576,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> > break;
> > }
> >
> > - interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
> > + interval = get_sd_balance_interval(sd, busy);
> >
> > need_serialize = sd->flags & SD_SERIALIZE;
> > if (need_serialize) {
> > @@ -9582,10 +9591,16 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> > * env->dst_cpu, so we can't know our idle
> > * state even if we migrated tasks. Update it.
> > */
> > - idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
> > + if (idle_cpu(cpu)) {
> > + idle = CPU_IDLE;
> > + busy = 0;
> > + } else {
> > + idle = CPU_NOT_IDLE;
> > + busy = !sched_idle_cpu(cpu);
> > + }
>
> This is inconsistent vs the earlier code. That is, why not write it
> like:
>
> idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
> busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);

This looks easier to read

>
> > }
> > sd->last_balance = jiffies;
> > - interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
> > + interval = get_sd_balance_interval(sd, busy);
> > }
> > if (need_serialize)
> > spin_unlock(&balancing);