Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

From: Qais Yousef
Date: Sat Feb 11 2023 - 12:51:00 EST


On 02/09/23 19:02, Dietmar Eggemann wrote:
> On 07/02/2023 10:45, Vincent Guittot wrote:
> > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >>
> >> When uclamp_max is being used, the util of the task could be higher than
> >> the spare capacity of the CPU, but due to uclamp_max value we force fit
> >> it there.
> >>
> >> The way the condition for checking for max_spare_cap in
> >> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> >> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> >> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> >> hence ending up never performing compute_energy() for this cluster and
> >> missing an opportunity for a better energy efficient placement to honour
> >> uclamp_max setting.
> >>
> >> max_spare_cap = 0;
> >> cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high
> >>
> >> ...
> >>
> >> util_fits_cpu(...); // will return true if uclamp_max forces it to fit
>
> s/true/1/ ?
>
> >>
> >> ...
> >>
> >> // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> >> if (cpu_cap > max_spare_cap) {
> >> max_spare_cap = cpu_cap;
> >> max_spare_cap_cpu = cpu;
> >> }
> >>
> >> prev_spare_cap suffers from a similar problem.
> >>
> >> Fix the logic by converting the variables into long and treating -1
> >> value as 'not populated' instead of 0 which is a viable and correct
> >> spare capacity value.
>
> The issue I see here is in case we don't have any spare capacity left,
> the energy calculation (in terms CPU utilization) isn't correct anymore.
>
> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
> you never know how big the `busy_time` for the PD really is in this moment.
>
> eenv_pd_busy_time()
>
> for_each_cpu(cpu, pd_cpus)
> busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
> ^^^^^^^^^
>
> with:
>
> sum_util = min(busy_time + task_busy_time, pd_cap)
> ^^^^^^^^^
>
> freq = (1.25 * max_util / max) * max_freq
>
> energy = (perf_state(freq)->cost / max) * sum_util
>
>
> energy is not related to CPU utilization anymore (since there is no idle
> time/spare capacity) left.

Am I right that what you're saying is that the energy calculation for the PD
will be capped to a certain value and this is why you think the energy is
incorrect?

What should be the correct energy (in theory at least)?

>
> So EAS keeps packing on the cheaper PD/clamped OPP.

Which is the desired behavior for uclamp_max?

The only issue I see is that we want to distribute within a pd. Which is
something I was going to work on and send after later - but can lump it in this
series if it helps.

>
> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
> tasks all running on little PD, cpumask=0x39. The big PD is idle but
> never beats prev_cpu in terms of energy consumption for the wakee.

IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
is two folds:

1. Prevent tasks from consuming energy.
2. Keep them away from expensive CPUs.

2 is actually very important for 2 reasons:

a. Because of max aggregation - any uncapped tasks that wakes up will
cause a frequency spike on this 'expensive' cpu. We don't have
a mechanism to downmigrate it - which is another thing I'm working
on.
b. It is desired to keep these bigger cpu idle ready for more important
work.

For 2, generally we don't want these tasks to steal bandwidth from these CPUs
that we'd like to preserve for other type of work.

Of course userspace has control by selecting the right uclamp_max value. They
can increase it to allow a spill to next pd - or keep it low to steer them more
strongly on a specific pd.


Cheers

--
Qais Yousef