Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

From: Vincent Guittot
Date: Wed Nov 28 2018 - 04:54:28 EST


Hi,

On Tue, 20 Nov 2018 at 11:55, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
>
> The current implementation of load tracking invariance scales the
> contribution with current frequency and uarch performance (only for
> utilization) of the CPU. One main result of this formula is that the
> figures are capped by current capacity of CPU. Another one is that the
> load_avg is not invariant because not scaled with uarch.
>
> The util_avg of a periodic task that runs r time slots every p time slots
> varies in the range :
>
> U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
>
> with U is the max util_avg value = SCHED_CAPACITY_SCALE
>
> At a lower capacity, the range becomes:
>
> U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p)
>
> with C reflecting the compute capacity ratio between current capacity and
> max capacity.
>
> so C tries to compensate changes in (1-y^r') but it can't be accurate.
>
> Instead of scaling the contribution value of PELT algo, we should scale the
> running time. The PELT signal aims to track the amount of computation of
> tasks and/or rq so it seems more correct to scale the running time to
> reflect the effective amount of computation done since the last update.
>
> In order to be fully invariant, we need to apply the same amount of
> running time and idle time whatever the current capacity. Because running
> at lower capacity implies that the task will run longer, we have to ensure
> that the same amount of idle time will be applied when system becomes idle
> and no idle time has been "stolen". But reaching the maximum utilization
> value (SCHED_CAPACITY_SCALE) means that the task is seen as an
> always-running task whatever the capacity of the CPU (even at max compute
> capacity). In this case, we can discard this "stolen" idle times which
> becomes meaningless.
>
> In order to achieve this time scaling, a new clock_pelt is created per rq.
> The increase of this clock scales with current capacity when something
> is running on rq and synchronizes with clock_task when rq is idle. With
> this mechanism, we ensure the same running and idle time whatever the
> current capacity. This also enables to simplify the pelt algorithm by
> removing all references of uarch and frequency and applying the same
> contribution to utilization and loads. Furthermore, the scaling is done
> only once per update of clock (update_rq_clock_task()) instead of during
> each update of sched_entities and cfs/rt/dl_rq of the rq like the current
> implementation. This is interesting when cgroup are involved as shown in
> the results below:
>
> On a hikey (octo Arm64 platform).
> Performance cpufreq governor and only shallowest c-state to remove variance
> generated by those power features so we only track the impact of pelt algo.
>
> each test runs 16 times
>
> ./perf bench sched pipe
> (higher is better)
> kernel tip/sched/core + patch
> ops/seconds ops/seconds diff
> cgroup
> root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38%
> level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57%
> level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86%
>
> hackbench -l 1000
> (lower is better)
> kernel tip/sched/core + patch
> duration(sec) duration(sec) diff
> cgroup
> root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57%
> level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60%
> level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66%
>
> Then, the responsiveness of PELT is improved when CPU is not running at max
> capacity with this new algorithm. I have put below some examples of
> duration to reach some typical load values according to the capacity of the
> CPU with current implementation and with this patch. These values has been
> computed based on the geometric series and the half period value:
>
> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
> 972 (95%) 138ms not reachable 276ms
> 486 (47.5%) 30ms 138ms 60ms
> 256 (25%) 13ms 32ms 26ms
>
> On my hikey (octo Arm64 platform) with schedutil governor, the time to
> reach max OPP when starting from a null utilization, decreases from 223ms
> with current scale invariance down to 121ms with the new algorithm.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

Is there anything else that I should do for these patches ?

Regards,
Vincent