Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration

From: Vincent Guittot
Date: Fri Jan 21 2022 - 10:27:47 EST


On Thu, 20 Jan 2022 at 22:12, Vincent Donnefort
<vincent.donnefort@xxxxxxx> wrote:
>
> [...]
>
> > > > And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> > > >
> > > > >
> > > > > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > > > > accounting and you don't apply any scale invariance if the cpu is not
> > > > > > idle
> > > > >
> > > > > The missing irq and paravirt time is the reason why it is called "estimator".
> > > > > But maybe there's a chance of improving this part with a lockless version of
> > > > > rq->prev_irq_time and rq->prev_steal_time_rq?
> > > > >
> > > > > > - IIUC your explanation in the commit message above, the (A) period
> > > > > > seems to be a problem only when idle but you apply it unconditionally.
> > > > >
> > > > > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > > > > worth something:
> > > > >
> > > > > A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> > > > > A B
> > > > >
> > > > > > If cpu is idle you can assume that clock_pelt should be equal to
> > > > > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > > > > be inaccurate and defeat the primary goal of this patch. If your
> > > > > > problem with clock_pelt is that the pending idle time is not accounted
> > > > > > for when entering idle but only at the next update (update blocked
> > > > > > load or wakeup of a thread). This patch below should fix this and
> > > > > > remove your A.
> > > > >
> > > > > That would help slightly the current situation, but this part is already
> > > > > covered by the estimator.
> > > >
> > > > But the estimator, as you name it, is wrong beaus ethe A part can't be
> > > > applied unconditionally
> > >
> > > Hum, it is used only in the !active migration. So we know the task was sleeping
> > > before that migration. As a consequence, the time we need to account is "sleeping"
> > > time from the task point of view, which is clock_pelt == clock_task (for
> > > __update_load_avg_blocked_se()). Otherwise, we would only decay with the
> > > "wallclock" idle time instead of the "scaled" one wouldn't we?
> >
> > clock_pelt == clock_task only when cpu is idle and after updating
> > lost_idle_time but you have no idea of the state of the cpu when
> > migrating the task
>
> I was just applying the time scaling at the task level. Why shall it depends on
> the CPU state?
>
> The situation would be as follows:
>
> <--X--> <--Y-->
> +-------+-------+-------+
> CPUX ___| B | A | B |___
> ^
> migrate A
>
> In a such scenario, CPUX's PELT clock is indeed scaled. The Task A running
> time (X) has already been accounted, so what's left is to get an idle time (Y)
> contribution accurate. We would usually rely on the CPU being idle for the
> catch-up and that time would be Y + (X - scaled(X)). Without the catch-up, we
> would account at the migration, for the sleeping time Y, only (scaled(Y)). Applied
> to the same graph as for update_rq_clock_pelt()'s:
>
> clock_task | 1| 2| 3| 4| 5| 6| 7| 8|
> clock_pelt | 1 | 2 | 3 | 4 | (CPU's running, clock_pelt is scaled)
> expected | 1 | 2 | 5| 6| 7| 8|
> <---- X ---><--- Y ---->
> Task A -------************----------
> ^
> migrate A
>
> Contribution for Task A idle time at the migration (as we know we won't have
> another chance to catch-up clock_task later) should be 6, not 2, regardless of
> the CPU state.

If task A wakes up on the same CPU, we sync with the scaled clock_pelt
so why using something different here

>
> _But_ indeed, there would be a risk of hitting the lost_idle_time threshold and
> decay too much... (which is absolutely not handled in the current version). So
> now, if we don't want to bother too much, we could simplify the problem and
> say (which is true with NOHZ_IDLE) that if the CPU is running, the clock must
> not be that old anyway. So we should only care of the idle case, which is
> mitigated with your proposed snippet and I allow to get rid of the [A]
> part (clock_task - clock_pelt).
>
> As per sched_clock_cpu() usage, I haven't measured anything yet but notice it's
> already used in the wakeup path in ttwu_queue_wakelist().
>