Re: [RFC] Energy/power monitoring within the kernel

From: Pawel Moll
Date: Wed Oct 24 2012 - 12:51:43 EST


On Wed, 2012-10-24 at 01:40 +0100, Thomas Renninger wrote:
> > More and more of people are getting interested in the subject of power
> > (energy) consumption monitoring. We have some external tools like
> > "battery simulators", energy probes etc., but some targets can measure
> > their power usage on their own.
> >
> > Traditionally such data should be exposed to the user via hwmon sysfs
> > interface, and that's exactly what I did for "my" platform - I have
> > a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> > enough to draw pretty graphs in userspace. Everyone was happy...
> >
> > Now I am getting new requests to do more with this data. In particular
> > I'm asked how to add such information to ftrace/perf output.
> Why? What is the gain?
>
> Perf events can be triggered at any point in the kernel.
> A cpufreq event is triggered when the frequency gets changed.
> CPU idle events are triggered when the kernel requests to enter an idle state
> or exits one.
>
> When would you trigger a thermal or a power event?
> There is the possibility of (critical) thermal limits.
> But if I understand this correctly you want this for debugging and
> I guess you have everything interesting one can do with temperature
> values:
> - read the temperature
> - draw some nice graphs from the results
>
> Hm, I guess I know what you want to do:
> In your temperature/energy graph, you want to have some dots
> when relevant HW states (frequency, sleep states, DDR power,...)
> changed. Then you are able to see the effects over a timeline.
>
> So you have to bring the existing frequency/idle perf events together
> with temperature readings
>
> Cleanest solution could be to enhance the exisiting userspace apps
> (pytimechart/perf timechart) and let them add another line
> (temperature/energy), but the data would not come from perf, but
> from sysfs/hwmon.
> Not sure whether this works out with the timechart tools.
> Anyway, this sounds like a userspace only problem.

Ok, so it is actually what I'm working on right now. Not with the
standard perf tool (there are other users of that API ;-) but indeed I'm
trying to "enrich" the data stream coming from kernel with user-space
originating values. I am a little bit concerned about effect of extra
syscalls (accessing the value and gettimeofday to generate a timestamp)
at a higher sampling rates, but most likely it won't be a problem. Can
report once I know more, if this is of interest to anyone.

Anyway, there are at least two debug/trace related use cases that can
not be satisfied that way (of course one could argue about their
usefulness):

1. ftrace-over-network (https://lwn.net/Articles/410200/) which is
particularly appealing for "embedded users", where there's virtually no
useful userspace available (think Android). Here a (functional) trace
event is embedded into a normal trace and available "for free" at the
host side.

2. perf groups - the general idea is that one event (let it be cycle
counter interrupt or even a timer) triggers read of other values (eg.
cache counter or - in this case - energy counter). The aim is to have a
regular "snapshots" of the system state. I'm not sure if the standard
perf tool can do this, but I do :-)

And last, but not least, there are the non-debug/trace clients for
energy data as discussed in other mails in this thread. Of course the
trace event won't really satisfy their needs either.

Thanks for your feedback!

PaweÅ


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/