Re: [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus

From: Morten Rasmussen
Date: Wed Mar 25 2015 - 14:17:58 EST


On Tue, Mar 24, 2015 at 01:33:22PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:11PM +0000, Morten Rasmussen wrote:
> > Make wake-ups of new tasks (find_idlest_group) aware of any differences
> > in cpu compute capacity so new tasks don't get handed off to a cpus with
> > lower capacity.
>
> That says what; but fails to explain why we want to do this.
>
> Remember Changelogs should answer what+why and if complicated also
> reason about how.

Right, I will make sure to fix that.

cpu compute capacity may be different for two reasons: 1. Compute
capacity of some cpus may be reduced due to RT and/or IRQ work
(capacity_of(cpu)), and 2. Compute capacity of some cpus is lower due
to different cpu microarchitectures (big.LITTLE, impacts both
capacity_of(cpu) and capacity_orig_of(cpu)).

find_idlest_group() is used for SD_BALANCE_{FORK,EXEC} but not
SD_BALANCE_WAKE. That means only for new tasks. Since we can't predict
how they are going to behave we assume that they are cpu intensive. This
is how the load tracking is initialized. If we stick we that assumption
new tasks should go on cpus with most capacity to ensure we don't slow
them down while they are building up their track load history. Putting
new tasks on high capacity cpus also improves system responsiveness as
it may take a while before tasks put on a cpu with too little capacity
gets migrated to one with higher capacity.

One could also imagine that one day thermal management constraints in
terms of frequency capping would be visible through capacity in the
scheduler, which be a third reason for having different compute
capacities.

>
> > @@ -4971,6 +4972,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >
> > /* Tally up the load of all CPUs in the group */
> > avg_load = 0;
> > + cpu_capacity = 0;
> >
> > for_each_cpu(i, sched_group_cpus(group)) {
> > /* Bias balancing toward cpus of our domain */
> > @@ -4980,6 +4982,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > load = target_load(i, load_idx);
> >
> > avg_load += load;
> > +
> > + if (cpu_capacity < capacity_of(i))
> > + cpu_capacity = capacity_of(i);
> > }
> >
> > /* Adjust by relative CPU capacity of the group */
>
> So basically you're constructing the max_cpu_capacity for that group
> here. Might it be clearer to explicitly name/write it as such?
>
> max_cpu_capacity = max(max_cpu_capacity, capacity_of(i));

Agreed.

>
> > @@ -4987,14 +4992,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >
> > if (local_group) {
> > this_load = avg_load;
> > + this_cpu_cap = cpu_capacity;
> > } else if (avg_load < min_load) {
> > min_load = avg_load;
> > idlest = group;
> > + idlest_cpu_cap = cpu_capacity;
> > }
> > } while (group = group->next, group != sd->groups);
> >
> > - if (!idlest || 100*this_load < imbalance*min_load)
> > + if (!idlest)
> > + return NULL;
> > +
> > + if (100*this_load < imbalance*min_load && this_cpu_cap >= idlest_cpu_cap)
> > return NULL;
>
> And here you then fail to provide an idlest group if the selected group
> has less (max) capacity than the current group.
>
> /me goes double check wth capacity_of() returns again, yes, this seems
> dubious. Suppose we have our two symmetric clusters and for some reason
> we've routed all our interrupts to one cluster and every cpu regularly
> receives interrupts. This means that the capacity_of() of this IRQ
> cluster is always less than the other.
>
> The above modification will result in tasks always being placed on the
> other cluster, even though it might be very busy indeed.
>
> If you want to do something like this; one should really add in a
> current usage metric or so.

Right, that if-condition is broken :(

We want to put new tasks on the cpu which has most spare capacity and if
nobody has spare capacity then decide based on the existing
(100*this_load < imbalance*min_load) condition. Maybe gather max cpu
available capacity instead of max_cpu_capacity. I will have to ponder
that a bit and come up with something.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/