Re: [discussion]sched: a rough proposal to enable power saving inscheduler

From: Arjan van de Ven
Date: Sat Aug 18 2012 - 10:52:45 EST


On 8/18/2012 7:33 AM, Luming Yu wrote:
> saving mode. But obviously, we need to spread as much as possible
> across all cores in another socket(to race to idle). So from the
> example above, we see a threshold that we need to reference before
> selecting one from two complete different policy: spread or not
> spread... As long as there is hardware limitation, we could always
> need knob like that referenced threshold to adapt on different
> hardware in one kernel....

I think the physics are slightly simpler, if you abstract it one level.

every reasonable system out there has things that can be off if all cores are in the deep power state,
that have to be on if even one of them is alive. On "big core" Intel, that's uncore and memory controller,
on small core (atom/phone) Intel that is the chipset fabric only. On ARM it might be something else. On all of
them it's some clocks, PLLs, voltage regulators etc etc.

not all chips are advanced enough to aggressively these things off when they could, but most are nowadays.

so in abstract, there's a power offset that gets you from 0 to 1, Lets call this P0
there is also a power offset to go from 1 to 2, but that's smaller than 0->1. Lets call this Pc

or rather, 0->1 has the same kind of offset as 1->2 plus some extra offset.. so P0 = Pbase + Pc

there's also an energy cost for waking a cpu up (and letting it go back to sleep afterwards)... call it Ewake

so the abstract question is
you're running a task A on cpu 0
you want to also run a task B, which you estimate to run for time T

it's more energy efficient to wake a 2nd cpu if

Ewake < T * Pbase

(this assumes all cores are the same, you get a more complex formula if that's not the case, where T is even core specific)


there is no hardware policy *switch* in such formula, only parameters.
If Pbase = 0 (e.g. your hardware has no extra power savings), then the formula very naturally leads to one extreme of the behavior
if Ewake is very high, then it leads to the other extreme.

The only other variable is the user preference between power and performance balance.. but that's a pure preference, not hardware
specific anymore.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/