Re: [PATCH v3 11/13] sched/fair: Consider spare capacity in find_idlest_group()

From: Morten Rasmussen
Date: Thu Aug 18 2016 - 07:18:05 EST


On Tue, Aug 16, 2016 at 03:57:06PM +0200, Vincent Guittot wrote:
> On 25 July 2016 at 15:34, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> > In low-utilization scenarios comparing relative loads in
> > find_idlest_group() doesn't always lead to the most optimum choice.
> > Systems with groups containing different numbers of cpus and/or cpus of
> > different compute capacity are significantly better off when considering
> > spare capacity rather than relative load in those scenarios.
> >
> > In addition to existing load based search an alternative spare capacity
> > based candidate sched_group is found and selected instead if sufficient
> > spare capacity exists. If not, existing behaviour is preserved.
> >
> > cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > ---
> > kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 41 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9c6ec3bf75ce..e3654409d099 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5164,6 +5164,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > return 1;
> > }
> >
> > +static inline int task_util(struct task_struct *p);
> > +static int cpu_util_wake(int cpu, struct task_struct *p);
> > +
> > +static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> > +{
> > + return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
> > +}
> > +
> > /*
> > * find_idlest_group finds and returns the least busy CPU group within the
> > * domain.
> > @@ -5173,7 +5181,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > int this_cpu, int sd_flag)
> > {
> > struct sched_group *idlest = NULL, *group = sd->groups;
> > + struct sched_group *most_spare_sg = NULL;
> > unsigned long min_load = ULONG_MAX, this_load = 0;
> > + unsigned long most_spare = 0, this_spare = 0;
> > int load_idx = sd->forkexec_idx;
> > int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >
> > @@ -5181,7 +5191,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > load_idx = sd->wake_idx;
> >
> > do {
> > - unsigned long load, avg_load;
> > + unsigned long load, avg_load, spare_cap, max_spare_cap;
> > int local_group;
> > int i;
> >
> > @@ -5193,8 +5203,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > local_group = cpumask_test_cpu(this_cpu,
> > sched_group_cpus(group));
> >
> > - /* Tally up the load of all CPUs in the group */
> > + /*
> > + * Tally up the load of all CPUs in the group and find
> > + * the group containing the cpu with most spare capacity.
> > + */
> > avg_load = 0;
> > + max_spare_cap = 0;
> >
> > for_each_cpu(i, sched_group_cpus(group)) {
> > /* Bias balancing toward cpus of our domain */
> > @@ -5204,6 +5218,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > load = target_load(i, load_idx);
> >
> > avg_load += load;
> > +
> > + spare_cap = capacity_spare_wake(i, p);
> > +
> > + if (spare_cap > max_spare_cap &&
> > + spare_cap > capacity_of(i) >> 3) {
>
> This condition probably needs some descriptions. You're not only
> looking for max spare capacity but also a significant spare capacity
> (more than 12.5% of cpu_capacity_orig). Can't this additional test
> lead to some strange situation where a CPU with more spare capacity
> will not be selected because of this 12.5% condition whereas another
> with less spare capacity will be selected because its capacity_orig is
> lower ?

Right, the reason why I added the 12.5% check is that I thought we
wouldn't want to pack cpus too aggressively. You are right that we could
reject a 1024 capacity with a spare capacity of 100 and pick a 512
capacity cpu with a spare capacity of 65.

Using an absolute margin instead, e.g. 128 instead of 12.5%, would fix
this particular issue but create another one as we would reject low
capacity cpus at very low utilization. For example, if we have cpus with
a capacity of 256. I thought that the relative threshold was more
generic and would cause less trouble, but I'm happy to discuss
alternatives.

>From a latency perspective it might not be a bad idea staying away from
cpus with a utilization even if they have more capacity available as the
task is more likely to end up waiting on the rq. For throughput tasks
you would of course want it the other way around.

>
> > + max_spare_cap = spare_cap;
> > + }
> > }
> >
> > /* Adjust by relative CPU capacity of the group */
> > @@ -5211,12 +5232,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >
> > if (local_group) {
> > this_load = avg_load;
> > - } else if (avg_load < min_load) {
> > - min_load = avg_load;
> > - idlest = group;
> > + this_spare = max_spare_cap;
> > + } else {
> > + if (avg_load < min_load) {
> > + min_load = avg_load;
> > + idlest = group;
> > + }
> > +
> > + if (most_spare < max_spare_cap) {
> > + most_spare = max_spare_cap;
> > + most_spare_sg = group;
> > + }
> > }
> > } while (group = group->next, group != sd->groups);
> >
> > + /* Found a significant amount of spare capacity. */
>
> It may worth explaining the threshold when it becomes better to choose
> the most spare group instead of the least loaded group.

Yes. I admit that the threshold is somewhat randomly chosen. Based on a
few experiments I found that requiring enough spare capacity to fit the
task completely was too conservative. We would bail out and go with the
least loaded groups very often, especially for new tasks, despite the
spare capacity only being slightly too small. Allowing a small degree of
stuffing of the task seemed better. Choosing the least loaded group
instead doesn't give any better throughput for the waking task unless it
has high priority. For overall throughput, the most spare capacity cpus
should be the better choice.

Should I just add a comment saying that we want to allow a little bit of
task stuffing to accommodate better for new tasks and have better overall
throughput, or should we investigate the threshold further?


>
> > + if (this_spare > task_util(p) / 2 &&
> > + imbalance*this_spare > 100*most_spare)
> > + return NULL;
> > + else if (most_spare > task_util(p) / 2)
> > + return most_spare_sg;
> > +
> > if (!idlest || 100*this_load < imbalance*min_load)
> > return NULL;
> > return idlest;
> > --
> > 1.9.1
> >