Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

From: Arjan van de Ven
Date: Sat Jul 13 2013 - 12:14:42 EST


On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
Hi,

This patch set is an initial prototype aiming at the overall power-aware
scheduler design proposal that I previously described
<http://permalink.gmane.org/gmane.linux.kernel/1508480>.

The patch set introduces a cpu capacity managing 'power scheduler' which lives
by the side of the existing (process) scheduler. Its role is to monitor the
system load and decide which cpus that should be available to the process
scheduler.

Hmm...

This looks like a userspace hotplug deamon approach lifted to kernel space :/

How about instead of layering over the load-balancer to constrain its behaviour
you change the behaviour to not need constraint? Fix it so it does the right
thing, instead of limiting it.

I don't think its _that_ hard to make the balancer do packing over spreading.
The power balance code removed in 8e7fbcbc had things like that (although it
was broken). And I'm sure I've seen patches over the years that did similar
things. Didn't Vincent and Alex also do things like that?

a basic "sort left" (e.g. when needing to pick a cpu for a task that is short running,
pick the lowest numbered idle one) will already have the effect of packing in practice.
it's not perfect packing, but on a statistical level it'll be quite good.

(this all assumes relatively idle systems with spare capacity to play with of course..
... but that's the domain where packing plays a role)



Arjan; from reading your emails you're mostly busy explaining what cannot be
done. Please explain what _can_ be done and what Intel wants. From what I can
see you basically promote a max P state max concurrency race to idle FTW.


btw one more thing I'd like to get is a communication between the scheduler
and the policy/hardware drivers about task migration.
When a task migrates to another CPU, the statistics that the hardware/driver/policy
were keeping on that target CPU are really not valid anymore in terms of forward
looking predictive power. A communication (API or notification or whatever form it takes)
around this would be quite helpful.
This could be as simple as just setting a flag on the target cpu (in their rq), so that
the next power event (exiting idle, P state evaluation, whatever) the policy code
can flush-and-start-over.


on thinking more about the short running task thing; there is an optimization we currently don't do,
mostly for hyperthreading. (and HT is just one out of a set of cases with similar power behavior)
If we know a task runs briefly AND is not performance critical, it's much much better to place it on
a hyperthreading buddy of an already busy core than it is to place it on an empty core (or to delay it).
Yes a HT pair isn't the same performance as a full core, but in terms of power the 2nd half of a HT pair
is nearly free... so if there's a task that's not performance sensitive (and won't disturb the other task too much,
e.g. runs briefly enough)... it's better to pack onto a core than to spread.
you can generalize this to a class of systems where adding work to a core (read: group of cpus that share resources)
is significantly cheaper than running on a full empty core.

(there is clearly a tradeoff, by sharing resources you also end up reducing performance/efficiency, and that has its
own effect on power, so there is some kind of balance needed and a big enough gain to be worth the loss)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/