Re: [discussion]sched: a rough proposal to enable power saving inscheduler

From: Peter Zijlstra
Date: Wed Aug 15 2012 - 07:05:51 EST


On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
> Since there is no power saving consideration in scheduler CFS, I has a
> very rough idea for enabling a new power saving schema in CFS.

Adding Thomas, he always delights poking holes in power schemes.

> It bases on the following assumption:
> 1, If there are many task crowd in system, just let few domain cpus
> running and let other cpus idle can not save power. Let all cpu take the
> load, finish tasks early, and then get into idle. will save more power
> and have better user experience.

I'm not sure this is a valid assumption. I've had it explained to me by
various people that race-to-idle isn't always the best thing. It has to
do with the cost of switching power states and the duration of execution
and other such things.

> 2, schedule domain, schedule group perfect match the hardware, and
> the power consumption unit. So, pull tasks out of a domain means
> potentially this power consumption unit idle.

I'm not sure I understand what you're saying, sorry.

> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
> power aware scheduling), this proposal will adopt the
> sched_balance_policy concept and use 2 kind of policy: performance, power.

Yay, ideally we'd also provide a 3rd option: auto, which simply switches
between the two based on AC/BAT, UPS status and simple things like that.
But this seems like a later concern, you have to have something to pick
between before you can pick :-)

> And in scheduling, 2 place will care the policy, load_balance() and in
> task fork/exec: select_task_rq_fair().

ack

> Here is some pseudo code try to explain the proposal behaviour in
> load_balance() and select_task_rq_fair();

Oh man.. A few words outlining the general idea would've been nice.

> load_balance() {
> update_sd_lb_stats(); //get busiest group, idlest group data.
>
> if (sd->nr_running > sd's capacity) {
> //power saving policy is not suitable for
> //this scenario, it runs like performance policy
> mv tasks from busiest cpu in busiest group to
> idlest cpu in idlest group;

Once upon a time we talked about adding a factor to the capacity for
this. So say you'd allow 2*capacity before overflowing and waking
another power group.

But I think we should not go on nr_running here, PJTs per-entity load
tracking stuff gives us much better measures -- also, repost that series
already Paul! :-)

Also, I'm not sure this is entirely correct, the thing you want to do
for power aware stuff is to minimize the number of active power domains,
this means you don't want idlest, you want least busy non-idle.

> } else {// the sd has enough capacity to hold all tasks.
> if (sg->nr_running > sg's capacity) {
> //imbalanced between groups
> if (schedule policy == performance) {
> //when 2 busiest group at same busy
> //degree, need to prefer the one has
> // softest group??
> move tasks from busiest group to
> idletest group;

So I'd leave the currently implemented scheme as performance, and I
don't think the above describes the current state.

> } else if (schedule policy == power)
> move tasks from busiest group to
> idlest group until busiest is just full
> of capacity.
> //the busiest group can balance
> //internally after next time LB,

There's another thing we need to do, and that is collect tasks in a
minimal amount of power domains. The old code (that got deleted) did
something like that, you can revive some of the that code if needed -- I
just killed everything to be able to start with a clean slate.


> } else {
> //all groups has enough capacity for its tasks.
> if (schedule policy == performance)
> //all tasks may has enough cpu
> //resources to run,
> //mv tasks from busiest to idlest group?
> //no, at this time, it's better to keep
> //the task on current cpu.
> //so, it is maybe better to do balance
> //in each of groups
> for_each_imbalance_groups()
> move tasks from busiest cpu to
> idlest cpu in each of groups;
> else if (schedule policy == power) {
> if (no hard pin in idlest group)
> mv tasks from idlest group to
> busiest until busiest full.
> else
> mv unpin tasks to the biggest
> hard pin group.
> }
> }
> }
> }

OK, so you only start to group later.. I think we can do better than
that.

>
> sub proposal:
> 1, If it's possible to balance task on idlest cpu not appointed 'balance
> cpu'. If so, it may can reduce one more time balancing.
> The idlest cpu can prefer the new idle cpu; and is the least load cpu;
> 2, se or task load is good for running time setting.
> but it should the second basis in load balancing. The first basis of LB
> is running tasks' number in group/cpu. Since whatever of the weight of
> groups is, if the tasks number is less than cpu number, the group is
> still has capacity to take more tasks. (will consider the SMT cpu power
> or other big/little cpu capacity on ARM.)

Ah, no we shouldn't balance on nr_running, but on the amount of time
consumed. Imagine two tasks being woken at the same time, both tasks
will only run a fraction of the available time, you don't want this to
exceed your capacity because ran back to back the one cpu will still be
mostly idle.

What you want it to keep track of a per-cpu utilization level (inverse
of idle-time) and using PJTs per-task runnable avg see if placing the
new task on will exceed the utilization limit.

I think some of the Linaro people actually played around with this,
Vincent?

> unsolved issues:
> 1, like current scheduler, it didn't handled cpu affinity well in
> load_balance.

cpu affinity is always 'fun'.. while there's still a few fun sites in
the current load-balancer we do better than we did a while ago.

> 2, task group that isn't consider well in this rough proposal.

You mean the cgroup mess?

> It isn't consider well and may has mistaken . So just share my ideas and
> hope it become better and workable in your comments and discussion.

Very simplistically the current scheme is a 'spread' the load scheme
(SD_PREFER_SIBLING if you will). We spread load to maximize per-task
cache and cpu power.

The power scheme should be a 'pack' scheme, where we minimize the active
power domains.

One way to implement this is to keep track of an active and
under-utilized power domain (the target) and fail the regular (pull)
load-balance for all cpus not in that domain. For the cpu that are in
that domain we'll have find_busiest select from all other under-utilized
domains pulling tasks to fill our target, once full, we pick a new
target, goto 1.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/