Re: [RFC v1] Tunable sched_mc_power_savings=n

From: Vaidyanathan Srinivasan
Date: Fri Jun 27 2008 - 02:23:55 EST


* Andi Kleen <andi@xxxxxxxxxxxxxx> [2008-06-27 00:38:53]:

> Peter Zijlstra wrote:
>
> >> And your workload manager could just nice processes. It should probably
> >> do that anyways to tell ondemand you don't need full frequency.
> >
> > Except that I want my nice 19 distcc processes to utilize as much cpu as
> > possible, but just not bother any other stuff I might be doing...
>
> They already won't do that if you run ondemand and cpufreq. It won't
> crank up the frequency for niced processes.

This may not provide the best power saving if the workload is bursty.
Finishing the job quickly and entering sleep states have better
impact. This is the race-to-idle problem where we want to maximise the
sleep state utilisation relative to reducing the frequency. The
benefit of this technique is certainly workload specific. However
even in this particular case, running at the lowest frequency is the
safest option from OS point of view for power savings. However for
maximum power savings, increasing sleep state utilisation have the
following advantages:

* Sleep states are per core while voltage and frequency control are
for multiple cores in a multi-core package. Hence freq change
decisions needs to be taken at the package level. Though ondemand
makes the decision based on per-core utilisation and process
priority, the actual effect in hardware is the highest freq
recommended by all cores. Per core decision is actually only
a recommendation or a vote.

* Moving tasks to less number of CPU package in a multi socket system
will provide maximum savings since even shared resources on the idle
sockets can be in low power states.

Multi socket systems with multi core CPUs have more controls for power
savings that were previously not available on single core systems.
Automatically making the right decision is an ideal solution. However
since there are trade-offs, we would like the users to experiment with
what suits them the best. The rational is similar to why we provide
different cpufreq governors and tunables.

If we discover a good automatic technique to choose the right power
saving strategy that is widely acceptable, then certainly we will go
for it. Can we build the stepping stone to reach there? Can we consider
these tunables as enablements for end users to try them out easily
and provide feedback?

>
> Extending that existing policy to socket load balancing would be only
> natural.

Consolidation based on task priority seems to be the challenge here.
However this is a good point. This is certainly a parameter for auto
tuning if only we can overcome the challenges in using priority for
task consolidation.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/