Re: [PATCH v3] sched/fair: unlink misfit task from cpu overutilized

From: Vincent Guittot
Date: Mon Jan 16 2023 - 06:23:47 EST


On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 13/01/2023 14:40, Vincent Guittot wrote:
> > By taking into account uclamp_min, the 1:1 relation between task misfit
> > and cpu overutilized is no more true as a task with a small util_avg of
>
> of ?
>
> > may not fit a high capacity cpu because of uclamp_min constraint.
> >
> > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > a CPU except for the uclamp_min hint which is a performance requirement.
> >
> > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > can use this new value to take additional action to select the best CPU
> > that doesn't match uclamp_min hint.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> >
> > Change since v2:
> > - fix a condition in feec()
> > - add comments
> >
> > kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
> > 1 file changed, 83 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e9d906a9bba9..29adb9e27b3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4525,8 +4525,7 @@ static inline int util_fits_cpu(unsigned long util,
> > * 2. The system is being saturated when we're operating near
> > * max capacity, it doesn't make sense to block overutilized.
> > */
> > - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> > + uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
>
> Isn't `uclamp_max <= capacity_orig` always true for `capacity_orig ==
> SCHED_CAPACITY_SCALE`?
>
> uclamp_max = [0..SCHED_CAPACITY_SCALE] , capacity_orig =
> SCHED_CAPACITY_SCALE
>
> > fits = fits || uclamp_max_fits;
>
> Like Qais I don't understand this change. I read the previous discussion
> from https://lkml.kernel.org/r/20221208140526.vvmjxlz6akgqyoma@airbuntu
> ("Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect
> capacity inversion").
>
> I assume Qais wanted to force `uclamp_max_fits = 0` for a big CPU
> (`capacity_orig = 1024`) and a `uclamp_max = 1024` to not lift `fits`
> from 0 to 1.
>
> > /*
> > @@ -4561,8 +4560,8 @@ static inline int util_fits_cpu(unsigned long util,
> > * handle the case uclamp_min > uclamp_max.
> > */
> > uclamp_min = min(uclamp_min, uclamp_max);
> > - if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> > - fits = fits && (uclamp_min <= capacity_orig_thermal);
> > + if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> > + return -1;
> > return fits;
> > }
> > @@ -4572,7 +4571,11 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> > unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > unsigned long util = task_util_est(p);
> > - return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> > + /*
> > + * Return true only if the cpu fully fits the task requirements, which
> > + * include the utilization but also the performance.
>
> Not sure if people get what `performance requirements` mean here? I know
> we want to use `performance` rather `bandwidth hint` to describe what
> uclamp is. So shouldn't we use `utilization but also uclamp`?
>
> > + */
> > + return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
> > }
> >
> > static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> > unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> > unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> >
> > + /* Return true only if the utlization doesn't fit its capacity */
>
> s/utlization/utilization
> s/its/CPU ?
>
> > return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> > }
>
> cpu_overutilized() is the only place where we now only test for
> !util_fits_cpu(). The new comment says we only care about utilization
> not fitting CPU capacity.
>
> Does this mean the rq uclamp values are not important here and we could
> go back to use fits_capacity()?
>
> Not sure since util_fits_cpu() is still coded differently:

uclamp_min is not important but uclamp_max still cap the utilization

>
> fits = fits_capacity(util, capacity)
>
> fits = fits || uclamp_max_fits <-- uclamp_max_fits can turn fits from
> 0 into 1, so util doesn't fit but
> we don't return 1?
>
> That said, I don't understand the current 'uclamp_max_fits = (uclamp_max
> <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE)' in
> util_fits_cpu(), like already mentioned.
>
> > @@ -6925,6 +6929,7 @@ static int
> > select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > {
> > unsigned long task_util, util_min, util_max, best_cap = 0;
> > + int fits, best_fits = 0;
> > int cpu, best_cpu = -1;
> > struct cpumask *cpus;
> >
> > @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >
> > if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > continue;
> > - if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > +
> > + fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > +
> > + /* This CPU fits with all capacity and performance requirements */
>
> In task_fits_cpu() `utilization and performance (better uclamp)
> requirements` term was used. I assume it's the same thing here?
>
> > + if (fits > 0)
> > return cpu;
> > + /*
> > + * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > + * for the CPU with highest performance capacity.
> ^^^^^^^^^^^^^^^^^^^^
>
> Do we use a new CPU capacity value `performance capacity (1)` here?
>
> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
>
> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
> to return -1. Shouldn't (1) and (2) be the same?

I'm all in favor of both being capacity_orig_of(cpu) -
thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection

>
> > + */
> > + else if (fits < 0)
> > + cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> >
> > - if (cpu_cap > best_cap) {
> > + /*
> > + * First, select cpu which fits better (-1 being better than 0).
> > + * Then, select the one with largest capacity at same level.
> > + */
> > + if ((fits < best_fits) ||
> > + ((fits == best_fits) && (cpu_cap > best_cap))) {
> > best_cap = cpu_cap;
> > best_cpu = cpu;
> > + best_fits = fits;
> > }
> > }
> >
>
> [...]
>
> > @@ -7446,26 +7480,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > if (prev_delta < base_energy)
> > goto unlock;
> > prev_delta -= base_energy;
> > + prev_thermal_cap = cpu_thermal_cap;
> > best_delta = min(best_delta, prev_delta);
> > }
> >
> > /* Evaluate the energy impact of using max_spare_cap_cpu. */
> > if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> > + /* Current best energy cpu fits better */
> > + if (max_fits < best_fits)
> > + continue;
> > +
> > + /*
> > + * Both don't fit performance (i.e. uclamp_min) but
> > + * best energy cpu has better performance.
>
> IMHO, `performance` stands for `cpu_thermal_cap` which is
> `cpu_capacity_orig - thermal pressure`.
>
> I assume `performance` is equal to `performance capacity` used in
> select_idle_capacity which uses thermal_load_avg() and not thermal
> pressure. Why this difference?
>
> [...]