Re: [PATCH v2] sched/cfs: make util/load_avg more stable

From: Vincent Guittot
Date: Tue Apr 25 2017 - 08:40:34 EST


On 25 April 2017 at 13:05, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> On 19/04/17 17:54, Vincent Guittot wrote:
>> In the current implementation of load/util_avg, we assume that the ongoing
>> time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX,
>> even if part of the time segment still remains to run. As a consequence, this
>> remaining part is considered as idle time and generates unexpected variations
>> of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should
>
> Why do you use the square brackets the other way around? Just curious.

This refers to the very beginning and very end of time segment formulas below.
That being said, 1024 is not reachable because at very end we have :
1024*MAX_LOAD_AVG*y+1024*1023 = 1023,9997

1002 is not reachable because at very beg we have
1024*MAX_LOAD_AVG*y+ 1024*0 = 1002,0577

But we are working with integer so [1002..1024[ is probably more correct

>
> 1002 stands for 1024*y^1 w/ y = 4008/4096 or y^32 = 0.5, right ? Might
> be worth mentioning.
>
>> stay at 1023.
>>
>> In order to keep the metric stable, we should not consider the ongoing time
>> segment when computing load/util_avg but only the segments that have already
>> fully elapsed. Bu to not consider the current time segment adds unwanted
>> latency in the load/util_avg responsivness especially when the time is scaled
>> instead of the contribution. Instead of waiting for the current time segment
>> to have fully elapsed before accounting it in load/util_avg, we can already
>> account the elapsed part but change the range used to compute load/util_avg
>> accordingly.
>>
>> At the very beginning of a new time segment, the past segments have been
>> decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current
>> time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal
>> to MAX_LOAD_AVG. In fact, the max value is
>> sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment.
>>
>> Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the
>> range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib].
>>
>> As the elapsed part is already accounted in load/util_sum, we update the max
>> value according to the current position in the time segment instead of
>> removing its contribution.
>
> Removing its contribution stands for '- 1024' of 'LOAD_AVG_MAX - 1024'
> which was added in patch 1/2?

removing its contribution refers to "- sa->period_contrib * weight"
and "- (running * sa->period_contrib << SCHED_CAPACITY_SHIFT))" in
patch 1/2 of the previous version

>
>>
>> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> ---
>>
>> Fold both patches in one
>>
>> kernel/sched/fair.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3f83a35..c3b8f0f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> /*
>> * Step 2: update *_avg.
>> */
>> - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
>> + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
>> if (cfs_rq) {
>> cfs_rq->runnable_load_avg =
>> - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
>> + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
>> }
>> - sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>> + sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);
>>
>> return 1;
>> }
>>