Re: Power Management framework proposal

From: david
Date: Sun Jul 22 2007 - 23:53:27 EST


On Sun, 22 Jul 2007, Arjan van de Ven wrote:

On Sun, 2007-07-22 at 11:56 -0700, david@xxxxxxx wrote:

I have a concern with this approach though. It seems to assume that
there is one global thing somewhere that sets the system state; in my
experience that is the wrong approach; in fact there is a very definite
evidence that there are many decisions on power that are to be made
local at a high frequency. An example of this is the processor speed;
the ondemand governer does exactly this for the cpus that can switch
speeds fast; it's just impossible to beat such a local, fast decision
with anything on a global scale.

the intent was not to have one global call that sets the mode on all
devices, but rather have one call for each device/subsystem, just the same
call in each case.

there's also nothing that says that there can only be one thing setting
the mode (although that does mean a fourth call 'report_current_mode()' or
similar is needed). and if you choose to have two pieces of software
managing the same device things could get 'interesting'.

as for the speed that such decisions need to be made.

this API is not saying anything about the speed of the decisions.
it's also not saying anything about if the decision makeing is being done
by kernelspace or userspace. it's just providing a common way for whatever
software is doing the decision makeing to find out it's options and set
the modes.


but it makes for a layer between the device and the setting of the
modes.. which sort of would defeat the option of having things truely
local.

Settings don't mean much in general (in specific cases, maybe), it's the
requirements that matter. The *intent* matters. Linus forced this into
cpufreq way back, and while I and perhaps others thought he was just
being silly, 6 years later it turns out he was absolutely right.

and the more I am seeing of cpufreq the more it looks like what I'm proposing, so I'm glad to see that it's a good model :-)

Maybe something else
A power policy management framework doesn't need a unified framework (I
know this for a fact, I'm hoping to release the code within a few
weeks). A unified interface doesn't even help one single bit: the
semantics of each part is *extremely* different even if you make it look
the same; the sameness is only cosmetic.

The consequences of managing a disk vs managing a cpu vs managing the
LCD brightness via the X server are all very different. The tradeoffs
you need to make are all very different. The things you want to control
are all very different. Trying to force a standard interface makes the
interface for a specific subsystem go away from the *actual* best
interface for that subsystem, for no gain since the thing that manages
the policy needs to have different parts for each *anyway*.

Ok, I can see that if things really are different then it's worth doing different things to control them.

however, let me go back to my original post on the subject here

right now drivers are supposed to have (forgive me if I get the function names wrong)

initialize()
shutdown()
suspend()
suspend_late()
resume()
resume_early()

with suspend taking one of several parameters
PM_EVENT_SUSPEND
PM_EVENT_FREEZE
PM_EVENT_PRETHAW

and the notes say that what is supposed to happen is fairly undefined becouse different things can have vastly different capabilities. so to really control the device you need other, per driver interfaces as well.

this API is driven by the activities that the suspend process is currently designed to use, and each routine assumes given existing state, if you call it when in any other state the results are undefined.

any match to the actual capabilities of the hardware is purely coincidental. to have any ability to control the mode of anything at runtime requires that the code doing so must have specific knowledge of the driver in question.


compare this underdefined mess to the sanity that cpufreq gives you for controlling different vendors CPUs with their different capabilities.

with cpufreq you somewhere have a table that goes something along the lines of

freq voltage
2.0GHz 3.0v
1.5GHz 3.0v
1.0GHz 1.5v
500MHz 0.8v

and a function that lets you select the freq you want

if cpufreq were to switch over the the API I'm suggesting the table would change to

mode capacity power
0 0 0
1 100 100
2 75 100 (or possibly 95, there is some benifit to a slower clock at the same voltage)
3 50 25
4 25 7

so it would be a relativly minor change, probably causing more disruption then benifit to change in and of itself.

also, other then efficiancy arguments, there's nothing that says the modes must be integers not strings. instead of 0-4 above you could use the entries from under freq in the first table.

I don't know how cpufreq handles a cpu with logic blocks that can be turned off individually but with the type of API I'm talking about you could easily have

mode capacity power
0 0 0
1 100 100 (full clock, both blocks on)
2 50 60 (full clock, one block off)
3 50 25 (half clock, half voltage, both blocks on)
4 25 15 (half clock, half voltage, one blocks off)
5 25 7 (quarter clock, quarter voltage, both blocks on)
6 12 4 (quarter clock, quarter voltage, one blocks off)
7 0 1 (clock stopped, but chip still energized, faster to wake up from then mode 0)

with the benifits of mode 2 vs 3, 4 vs 5, and 7 vs 0 showing up in the transition cost matrix where it would show that it's faster to go up to the high-capacity modes from the first of each set then from the second, even though there are power saving advantages to the second in each set.

but the idea of adding the cpu control to this API was an afterthought, the biggest thing was to get something better then the current mess for other devices, and the fact that cpufreq was initially seen as a waste of time, but now you are seeing it's value could be an argument to do a similar transition for the power modes of other devices as well.

Now I realize that the needs for "hard small embedded" are different
from "PC like", and to be honest, I don't think it's entirely possible
to unify them; I don't think it's even worthwhile to pursue that (look
at where those attempts have gotten us so far)... but I suspect even in
the small embedded space a standard, forced and thus unnatural interface
isn't what is needed.

I am thinking that a standard way to define the availble modes of operation of a piece of hardware is an advantage for all scales. even if the generic API doesn't quite cover every possible mode (if you have enough knobs to twist the combinational explosion of the possible modes may mean that you don't actually implement all of them) makeing it possible for software to discover and set the modes for different devices without having to know specifics of the drivers would be a good thing.

you mention LCD backlights as an example of something non-standard enough to create a new intrface for. I think it would fit the API I'm proposing quite nicely

example 1: a laptop screen

mode capacity power description
0 0 0 off
1 100 100 full brightness
2 70 60 half power to the backlight
3 50 35 quarter power to the backlight
4 30 25 eighth power to the backlight
5 5 10 backlight off.

example 2: a front-panel display on a server (no variable backlight control)

mode capacity power description
0 0 0 off
1 100 100 backlight on
2 50 10 backlight off

unless the device had a light sensor with it I wouldn't expect these settings to be changed automaticaly, but this API would make it trivial for userspace tools to be able to control the brightness of any display with no driver-specific code, they would just look for display type objects, read the capabilities, and change the modes as the user requests.

currently it would proabably take two different software packages to control the backlights on these two devices, one that understands the video display driver (and would probably be pretty specific to that driver) and a second one that would understand the front-panel display driver.

with the current situation it's practicaly impossible to create a tool that allows you to set the power saving modes for everything in a system. that tool would need to know the ins and outs of every driver, and keep up to date on driver changes.

and the flip side of this is that it's also very hard to get the power saving features of a new device handled in an appropriate manner, you not only need to write the capabilities into the driver, you have to write a utility to control those capabilities, and then try and get similar software included in all the sstem utilities that you would want to use to control those capabilities

with the approach I'm proposing creating such a tool would be fairly simple, it would walk the sysfs tree to see what hardware is there, read what modes it can be set in (including flags that tell you that things below it need to be in modes with specific capabilities if appropriate) and let you change them.

if you don't want to make the shift with cpufreq, that's fine. it sounds like you are at least 90% of the way there anyway, it's not that big a deal, but do you think that there's value in replacing the current ad-hoc approach with something more structured (even if it's not this proposal)?

David Lang


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/