Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information toupper domain level groups

From: Peter Zijlstra
Date: Tue Aug 25 2009 - 03:12:21 EST

On Mon, 2009-08-24 at 23:49 +0530, Balbir Singh wrote:

> That reminds me, accounting is currently broken and should be based on
> APER/MPERF (Power gets it right - based on SPURR).

What accounting?

> > The trouble is that cpu_power is now abused for placement decisions too,
> > and that needs to be taken out.
> OK.. so you propose extending the static cpu_power to dynamic
> cpu_power but based on current topology?

Right, so cpu_power is primarily used to normalize domain weight in the

Suppose a 4 core machine with 1 unplugged core:


0,1 3

The sd-0,1 will have cpu_power 2048, while the sd-3 will have 1024, this
allowed find_busiest_group() for sd-0,1,3 to pick the one which is
relatively most overloaded.

Supposing 3, 2, 2 (nice0) tasks on these cores, the domain weight of
sd-0,1 is 5*1024 and sd-3 is 2*1024, normalized that becomes 5/2 and 2
resp. which clearly shows sd-0,1 to be the busiest of the pair.

Now back in the days Nick wrote all this, he did the cpu_power hack for
SMT which sets the combined cpu_power of 2 threads (that's all we had
back then) to 1024, because two threads share 1 core, and are roughly as

He then also used this to influence task placement, preferring to move
tasks to another sibling domain before getting the second thread active,
this worked.

Then multi-core with shared caches came along and people did the same
trick for mc power save in order to get that placement stuff, but that
horribly broke the load-balancer normalization.

Now comes multi-node, and people asking for more elaborate placement
strategies and all this starts creaking like a ghost house about to

Therefore I want cpu_power back to load normalization only, and do the
placement stuff with something else.

Once cpu_power is pure again, we can start making it dynamic, for SMT we
can utilize APERF/MPERF to guesstimate the actual work capacity of
threads, and scaling cpu_power back based on RT time used on the cpu.

Then when we walk the domain tree for load-balancing we re-do the
cpu_power sum, etc..

