Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()

From: Vincent Donnefort
Date: Mon Feb 22 2021 - 10:02:47 EST


On Mon, Feb 22, 2021 at 12:23:04PM +0000, Quentin Perret wrote:
> On Monday 22 Feb 2021 at 11:36:03 (+0000), Vincent Donnefort wrote:
> > Here's with real life numbers.
> >
> > The task: util_avg=3 (1) util_est=11 (2)
> >
> > pd0 (CPU-0, CPU-1, CPU-2)
> >
> > cpu_util_next(CPU-0, NULL): 7
> > cpu_util_next(CPU-1, NULL): 3
> > cpu_util_next(CPU-2, NULL): 0 <- Most capacity, try to place task here.
> >
> > cpu_util_next(CPU-2, task): 0 + 11 (2)
> >
> >
> > pd1 (CPU-3):
> >
> > cpu_util_next(CPU-3, NULL): 77
> >
> > cpu_util_next(CPU-3, task): 77 + 3 (1)
> >
> >
> > On pd0, the task contribution is 11. On pd1, it is 3.
>
> Yes but that accurately reflects what the task's impact on frequency
> selection of those CPUs if it was enqueued there, right?
>
> This is an important property we should aim to keep, the frequency
> prediction needs to be in sync with the actual frequency request, or
> the energy estimate will be off.

You mean that it could lead to a wrong frequency estimation when doing
freq = map_util_freq() in em_cpu_energy()?

But in any case, the computed energy, being the product of sum_util with the
OPP's cost, it is directly affected by this util_avg/util_est difference.

In the case where the task placement doesn't change the OPP, which is often the
case, we can simplify the comparison and end-up with the following:

delta_energy(CPU-3): OPP3 cost * (cpu_util_avg + task_util_avg - cpu_util_avg)
delta_energy(CPU-2): OPP2 cost * (cpu_util_est + task_util_est - cpu_util_est)

=> OPP3 cost * task_util_avg < task_util_est * OPP2 cost

With the same example I described previously, if you add the scaled OPP cost of
0.76 for CPU-3 and 0.65 for CPU-2 (real life OPP scaled costs), we have:

2.3 (CPU-3) < 7.15 (CPU-2)

The task is placed on CPU-3, while it would have been much more efficient to use
CPU-2.

>
> > When computing the energy
> > deltas, pd0's is likely to be higher than pd1's, only because the task
> > contribution is higher for one comparison than the other.
>
> You mean the contribution to sum_util right? I think I see what you mean
> but I'm still not sure if this really is an issue. This is how util_est
> works, and the EM stuff is just consistent with that.
>
> The issue you describe can only happen (I think) when a rq's util_avg is
> larger than its util-est emwa by some margin (that has to do with the
> ewma-util_avg delta for the task?). But that means the ewma is not to be
> trusted to begin with, so ...

cfs_rq->avg.util_est.ewma is not used. cpu_util() will only return the max
between ue.enqueued and util_avg.