Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86

From: Arjan van de Ven
Date: Mon May 26 2008 - 14:52:57 EST


On Tue, 27 May 2008 00:06:56 +0530
Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:

> Arjan van de Ven wrote:
> > On Mon, 26 May 2008 22:54:43 +0530
> > Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:
> >
> >> Arjan,
> >>
> >>
> >> These problems exist anyway, irrespective of scaled accounting (I'd
> >> say that they are exceptions)
> >>
> >> 1. The management tool does have access to the current frequency
> >> and maximum frequency, irrespective of scaled accounting. The
> >> decision could still be taken on the data that is already
> >> available and management tools can already use them
> >
> > it's sadly not as easy as you make it sound. From everything you
> > wrote you're making the assumption "if we're not at maximum
> > frequency, we have room to spare", which is very much not a correct
> > assumption
> >
>
> That's true in general. If the CPUs are throttled due to overheating,
> the system management application will figure out that it cannot
> change the frequency.

It's not the system management application but the kernel (and the
hardware! Esp in case of IDA) that manage the frequency.

> How do I interpret my CPU frequency applet's
> data when it says that the system is running at 46%?

That is a very good question. The answer is "uhh badly". Sad but true.
I'm not arguing against that.
The problem I have is that what you're doing does not make it better!

>
> >> 2. With IDA, we'd have to
> >> document that APERF/MPERF can be greater than 100% if the system is
> >> overclocked.
> >>
> >> Scaled accounting only intends to provide data already available.
> >> Interpretation is left to management tools and we'll document the
> >> corner cases that you just mentioned.
> >
> > IDA is not overclocking, nor is it a corner case *at all*. It's the
> > common case in fact on more modern systems. Having the kernel
> > present "raw" data to applications that then have no idea how to
> > really use it to be honest isn't very attractive to me as idea:
> > you're presenting a very raw hardware interface that will keep
> > changing over time in terms of how to interpret the data... the
> > kernel needs to abstract such hard stuff from applications, not
> > fully expose them to it. Especially since these things *ARE* tricky
> > and *WILL* change. Future x86 hardware will have behavior that
> > makes the "oh we'll document the corner cases" extremely
> > unpractical. Heck, even todays hardware (but arguably not yet the
> > server hardware) behaves like that. "Documenting the common case as
> > corner case" is not the right thing to do when introducing some new
> > behavior/interface. Sorry.
>
> Before I argue against that, I would like to ask
>
> 1. How are APERF/MPERF be meant to be utilized?

It's meant to be used by the cpu frequency governors to figure out how
many actual cycles are actually used (esp needed in case of IDA).

> 2. The CPU frequency driver/governer uses APERF/MPERF as well - we
> could argue and say that it should not be using/exposing that data to
> user space or using that data to make decisions.

that's a case where it really makes sense; it's the case where the
thing that controls the cpu P-state actually learns about how much work
was done to reevaluate what the cpu frequency should be going forward.
Eg it's a case of comparing actual frequency (APERF/MPERF) to see
what's useful to set next.
IDA makes this all needed due to the dynamic nature of the concept of
"frequency".

> 3. How do I answer the following problem
>
> My CPU utilization is 50% at all frequencies (since utilization is
> time based), does it mean that frequency scaling does not impact my
> workload?

without knowing anything else than this, then yes that would be a
logical conclusion: the most likely cause would be because your cpu is
memory bound. In fact, you could then scale down your cpu
frequency/voltage to be lower, and save some power without losing
performance.
It's a weird workload though, its probably a time based thing where you
alternate between idle and fully memory bound loads.

(which is another case where your patches would then expose idle time
even though your cpu is fully utilized for the 50% of the time it's
running)



--
If you want to reach me at my work email, use arjan@xxxxxxxxxxxxxxx
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/