Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

From: Dietmar Eggemann
Date: Thu Feb 09 2023 - 13:02:26 EST


On 07/02/2023 10:45, Vincent Guittot wrote:
> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>>
>> When uclamp_max is being used, the util of the task could be higher than
>> the spare capacity of the CPU, but due to uclamp_max value we force fit
>> it there.
>>
>> The way the condition for checking for max_spare_cap in
>> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
>> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
>> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
>> hence ending up never performing compute_energy() for this cluster and
>> missing an opportunity for a better energy efficient placement to honour
>> uclamp_max setting.
>>
>> max_spare_cap = 0;
>> cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high
>>
>> ...
>>
>> util_fits_cpu(...); // will return true if uclamp_max forces it to fit

s/true/1/ ?

>>
>> ...
>>
>> // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>> if (cpu_cap > max_spare_cap) {
>> max_spare_cap = cpu_cap;
>> max_spare_cap_cpu = cpu;
>> }
>>
>> prev_spare_cap suffers from a similar problem.
>>
>> Fix the logic by converting the variables into long and treating -1
>> value as 'not populated' instead of 0 which is a viable and correct
>> spare capacity value.

The issue I see here is in case we don't have any spare capacity left,
the energy calculation (in terms CPU utilization) isn't correct anymore.

Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
you never know how big the `busy_time` for the PD really is in this moment.

eenv_pd_busy_time()

for_each_cpu(cpu, pd_cpus)
busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
^^^^^^^^^

with:

sum_util = min(busy_time + task_busy_time, pd_cap)
^^^^^^^^^

freq = (1.25 * max_util / max) * max_freq

energy = (perf_state(freq)->cost / max) * sum_util


energy is not related to CPU utilization anymore (since there is no idle
time/spare capacity) left.

So EAS keeps packing on the cheaper PD/clamped OPP.

E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
tasks all running on little PD, cpumask=0x39. The big PD is idle but
never beats prev_cpu in terms of energy consumption for the wakee.

[...]