Re: [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus

From: Peter Zijlstra
Date: Tue Mar 24 2015 - 09:33:41 EST


On Wed, Feb 04, 2015 at 06:31:11PM +0000, Morten Rasmussen wrote:
> Make wake-ups of new tasks (find_idlest_group) aware of any differences
> in cpu compute capacity so new tasks don't get handed off to a cpus with
> lower capacity.

That says what; but fails to explain why we want to do this.

Remember Changelogs should answer what+why and if complicated also
reason about how.

> @@ -4971,6 +4972,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>
> /* Tally up the load of all CPUs in the group */
> avg_load = 0;
> + cpu_capacity = 0;
>
> for_each_cpu(i, sched_group_cpus(group)) {
> /* Bias balancing toward cpus of our domain */
> @@ -4980,6 +4982,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> load = target_load(i, load_idx);
>
> avg_load += load;
> +
> + if (cpu_capacity < capacity_of(i))
> + cpu_capacity = capacity_of(i);
> }
>
> /* Adjust by relative CPU capacity of the group */

So basically you're constructing the max_cpu_capacity for that group
here. Might it be clearer to explicitly name/write it as such?

max_cpu_capacity = max(max_cpu_capacity, capacity_of(i));

> @@ -4987,14 +4992,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>
> if (local_group) {
> this_load = avg_load;
> + this_cpu_cap = cpu_capacity;
> } else if (avg_load < min_load) {
> min_load = avg_load;
> idlest = group;
> + idlest_cpu_cap = cpu_capacity;
> }
> } while (group = group->next, group != sd->groups);
>
> - if (!idlest || 100*this_load < imbalance*min_load)
> + if (!idlest)
> + return NULL;
> +
> + if (100*this_load < imbalance*min_load && this_cpu_cap >= idlest_cpu_cap)
> return NULL;

And here you then fail to provide an idlest group if the selected group
has less (max) capacity than the current group.

/me goes double check wth capacity_of() returns again, yes, this seems
dubious. Suppose we have our two symmetric clusters and for some reason
we've routed all our interrupts to one cluster and every cpu regularly
receives interrupts. This means that the capacity_of() of this IRQ
cluster is always less than the other.

The above modification will result in tasks always being placed on the
other cluster, even though it might be very busy indeed.

If you want to do something like this; one should really add in a
current usage metric or so.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/