Re: [PATCH v6 2/7] sched/fair: Decay task PELT values during wakeup migration

From: Vincent Donnefort
Date: Wed Apr 27 2022 - 06:39:33 EST




On 27/04/2022 10:25, Tao Zhou wrote:
On Tue, Apr 26, 2022 at 10:35:01AM +0100, Vincent Donnefort wrote:

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

---------|----- A -----|-----------|------- B -----|>
clock_pelt clock_task clock now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

sched_clock_cpu() is a costly functinon. Limit the usage to the case where
the source CPU is idle as we know this is when the clock is having the
biggest risk of being outdated.

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. We then need to store those values in a timestamp variable
which can be accessed during the migration. rq's enter_idle will give the
wall-clock time when the rq went idle. We have then:

B = sched_clock_cpu() - rq->enter_idle.

Then, to catch-up the PELT clock scaling (A), two cases:

* !CFS_BANDWIDTH: We can simply use clock_task(). This value is stored
in rq's clock_pelt_idle, before the rq enters idle. The estimated time
is then:

rq->clock_pelt_idle + sched_clock_cpu() - rq->enter_idle.

* CFS_BANDWIDTH: We can't catch-up with clock_task because of the
throttled_clock_task_time offset. cfs_rq's clock_pelt_idle is then
giving the PELT clock when the cfs_rq becomes idle. This gives:

A = rq->clock_pelt_idle - cfs_rq->clock_pelt_idle

The code calulating A below is not consistent with this. The order is reversed.


Good catch, but this comment is actually correct, the code is not. rq->clock_pelt_idle is updated _after_ cfs_rq->clock_pelt_idle. (see
previous email to Vincent)

Thanks,

[...]