Re: [PATCH 3/3] perf/x86/rapl: Enable Baytrail/Braswell RAPL support

From: Thomas Gleixner
Date: Wed Sep 14 2016 - 12:19:14 EST


On Tue, 13 Sep 2016, Pan, Harry wrote:
> This things is because of the Baytrail/Braswell quirk breaks original
> assumption of perf RAPL polling timer rate calculation regarding of
> counter overflow case based on 200W;

ESU are the 'Energy Status Units' bits in the MSR_RAPL_POWER_UNIT msr.

ESU = (rdmsr(MSR_RAPL_POWER_UNIT) >> 8) & 0x1f;

So we have 5 bits of information and therefor:

0 <= ESU <= 31

The standard readout is:

joules = counter_value * mult;

mult = 1 / (2 ^ ESU)

The resulting multiplier is:

31 <= ESU <= 0
4.65661e-10J <= mult <= 1J

The scale function does:

val = counter << (32 - ESU);

which is converting the readout in to units of

4.65661e-10J / 2 == 2.32830e-10J

because the shift is actually: (1 + (31 - ESU)).


The math for Baytrail/Braswell is:

microjoules = counter_value * mult

mult = 2 ^ ESU

The resulting multiplier is:

31 <= ESU <= 0
1 uJ <= mult <= 2.14748e+09 uJ
1e-6J <= mult <= 2147J

So now your baytrail/braswell quirk does:

ESU = 32 - ESU

so the scale function becomes:

val = counter << (32 - (32 - ESU))
==> val = counter << ESU

which is converting the readout to units of

1e-6J

So now you are concerned about the rapl_timer interval which is calculated
so that the counter does not overflow for a total dissipation of 200W,
which is equivalent to 200J/s. The maximum counter width is 32 bit.

So depending on ESU the code scales the timeout to:

t[ESU] = 1 << (31 - ESU) / 200

So for the normal case we get:

t[0] = 10.737e6 s
...
t[30] = 0.010 s
t[31] = 0.005 s

The counter capacity for ESU=31 is

cap = (1 << 32) * 4.65661e-10J = 2J

So:

toverfl = 2J / 200W = 0.01s

which we cut in half to avoid running the timer and the counter in lockstep
which can cause overflows to go undetected. So this looks correct.


But for your Baytrail/Braswel that results in:

t[ESU] = 1 << (31 - (32 - ESU)) / 200

t[0] = TOTAL CRAP because the shift value becomes -1

But what saves you here is the check for

if (hwunit < 32)

which catches the hwunit = 32 - ESU[0] case and sets the timer to 2ms. So
for the remaining ones we have:

t[1] = 0.005s
...
t[31] = 5.3687e+06s

So lets look at the counter capacity for ESU=1:

cap = (1 << 32) * 2 uJ == 8589.92J

The resulting overflow is:

toverfl = 8589.92J / 200W = 42.9496 s

So if we divide this by two then we result in: 21.4748 s

So your timeout is actually off by factor ~4k, which is not surprising due
to the fact that the capacity has a ratio of 1 : 2147.48 and you have an
additional off by one due to the (32 - ESU) quirk.....

So the overflow prevention timer fires 4k times for no good reason. Indeed
a very power friendly design.

The timer calculation magically works for the original standard conversion,
but in this case it is utter crap. You really want to have a proper scale
factor for the timer calculation so we end up with:

toverfl = capacity / 200

i.e. you need a way to calculate capacity from the hw_units[] mess and some
factor which is dependent on the base unit. That all can be done with plain
integer math.

> in short, it leads every 80ms system triggers an event to read counters,

I have no idea where these 80ms come from and I can't make any sense from
the rest of your response either.

Fact is, that you did not do the math amd just tinkered the
Baytrail/Braswell support into the existing code and declared it done when
it did not blow up in your face.

Really excellent engineering work - NOT!

Thanks,

tglx