Re: [PATCH V4 3/3] thermal: cpufreq_cooling: Reuse sched_cpu_util() for SMP platforms

From: Dietmar Eggemann
Date: Thu Dec 03 2020 - 06:54:53 EST


On 24/11/2020 07:26, Viresh Kumar wrote:
> Several parts of the kernel are already using the effective CPU
> utilization (as seen by the scheduler) to get the current load on the
> CPU, do the same here instead of depending on the idle time of the CPU,
> which isn't that accurate comparatively.
>
> This is also the right thing to do as it makes the cpufreq governor
> (schedutil) align better with the cpufreq_cooling driver, as the power
> requested by cpufreq_cooling governor will exactly match the next
> frequency requested by the schedutil governor since they are both using
> the same metric to calculate load.
>
> This was tested on ARM Hikey6220 platform with hackbench, sysbench and
> schbench. None of them showed any regression or significant
> improvements. Schbench is the most important ones out of these as it
> creates the scenario where the utilization numbers provide a better
> estimate of the future.
>
> Scenario 1: The CPUs were mostly idle in the previous polling window of
> the IPA governor as the tasks were sleeping and here are the details
> from traces (load is in %):
>
> Old: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=203 load={{0x35,0x1,0x0,0x31,0x0,0x0,0x64,0x0}} dynamic_power=1339
> New: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=600 load={{0x60,0x46,0x45,0x45,0x48,0x3b,0x61,0x44}} dynamic_power=3960

When I ran schbench (-t 16 -r 5) on hikey960 I get multiple (~50)
instances of ~80ms task activity phase and then ~20ms idle phase on all
CPUs.

So I assume that scenario 1 is at the beginning (but you mentioned the
task were sleeping?) and scenario 2 is somewhere in the middle of the
testrun?
IMHO, the util-based approach delivers really better results at the
beginning and at the end of the entire testrun.
During the testrun, the util-based and the idle-based approach deliver
similar results.

It's a little bit tricky to compare test results since the IPA sampling
rate is 100ms and the load values you get depend on how the workload
pattern and the IPA sampling align.

> Here, the "Old" line gives the load and requested_power (dynamic_power
> here) numbers calculated using the idle time based implementation, while
> "New" is based on the CPU utilization from scheduler.
>
> As can be clearly seen, the load and requested_power numbers are simply
> incorrect in the idle time based approach and the numbers collected from
> CPU's utilization are much closer to the reality.

I assume the IPA sampling is done after ~50ms of the first task activity
phase.

> Scenario 2: The CPUs were busy in the previous polling window of the IPA
> governor:
>
> Old: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=800 load={{0x64,0x64,0x64,0x64,0x64,0x64,0x64,0x64}} dynamic_power=5280
> New: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=708 load={{0x4d,0x5c,0x5c,0x5b,0x5c,0x5c,0x51,0x5b}} dynamic_power=4672
>
> As can be seen, the idle time based load is 100% for all the CPUs as it
> took only the last window into account, but in reality the CPUs aren't
> that loaded as shown by the utilization numbers.

Is this an IPA sampling at the end of the ~20ms idle phase?

[...]