Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Vincent Guittot
Date: Tue Apr 25 2017 - 05:06:04 EST


On 25 April 2017 at 10:46, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> On 24 April 2017 at 22:14, Tejun Heo <tj@xxxxxxxxxx> wrote:
>> We noticed that with cgroup CPU controller in use, the scheduling
>> latency gets wonky regardless of nesting level or weight
>> configuration. This is easily reproducible with Chris Mason's
>> schbench[1].
>>
>> All tests are run on a single socket, 16 cores, 32 threads machine.
>> While the machine is mostly idle, it isn't completey. There's always
>> some variable management load going on. The command used is
>>
>> schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>
>> which measures scheduling latency with 32 threads each of which
>> repeatedly runs for 15ms and then sleeps for 10ms. Here's a
>> representative result when running from the root cgroup.
>>
>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>> Latency percentiles (usec)
>> 50.0000th: 26
>> 75.0000th: 62
>> 90.0000th: 74
>> 95.0000th: 86
>> *99.0000th: 887
>> 99.5000th: 3692
>> 99.9000th: 10832
>> min=0, max=13374
>>
>> The following is inside a first level CPU cgroup with the maximum
>> weight.
>>
>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>> Latency percentiles (usec)
>> 50.0000th: 31
>> 75.0000th: 65
>> 90.0000th: 71
>> 95.0000th: 91
>> *99.0000th: 7288
>> 99.5000th: 10352
>> 99.9000th: 12496
>> min=0, max=13023
>>
>> Note the drastic increase in p99 scheduling latency. After
>> investigation, it turned out that the update_sd_lb_stats(), which is
>> used by load_balance() to pick the most loaded group, was often
>> picking the wrong group. A CPU which has one schbench running and
>> another queued wouldn't report the correspondingly higher
>> weighted_cpuload() and get looked over as the target of load
>> balancing.
>>
>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>> sum of the load_avg of all queued sched_entities. Without cgroups or
>> at the root cgroup, each task's load_avg contributes directly to the
>> sum. When a task wakes up or goes to sleep, the change is immediately
>> reflected on runnable_load_avg which in turn affects load balancing.
>>
>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>> immediate reflection. When a task wakes up inside a cgroup, the
>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>> propagated to the parent. It only affects the matching sched_entity's
>> load_avg over time which then gets propagated to the runnable_load_avg
>> at that level. This makes weighted_cpuload() often temporarily out of
>> sync leading to suboptimal behavior of load_balance() and increase in
>> scheduling latencies as shown above.
>>
>> This patch fixes the issue by updating propagate_entity_load_avg() to
>> always propagate to the parent's runnable_load_avg. Combined with the
>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>> of the scaled loads of all tasks queued below removing the artifacts
>> from nesting cfs_rqs. The following is from inside three levels of
>> nesting with the patch applied.
>
> So you are changing the purpose of propagate_entity_load_avg which
> aims to propagate load_avg/util_avg changes only when a task migrate
> and you also want to propagate the enqueue/dequeue in the parent
> cfs_rq->runnable_load_avg

In fact you want that sched_entity load_avg reflects
cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

>
>>
>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>> Latency percentiles (usec)
>> 50.0000th: 40
>> 75.0000th: 71
>> 90.0000th: 89
>> 95.0000th: 108
>> *99.0000th: 679
>> 99.5000th: 3500
>> 99.9000th: 10960
>> min=0, max=13790
>>
>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>
>> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Cc: Mike Galbraith <efault@xxxxxx>
>> Cc: Paul Turner <pjt@xxxxxxxxxx>
>> Cc: Chris Mason <clm@xxxxxx>
>> ---
>> kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
>> 1 file changed, 21 insertions(+), 13 deletions(-)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>
>> /* Take into account change of load of a child task group */
>> static inline void
>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> + bool propagate_avg)
>> {
>> struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>> long load = 0, delta;
>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>> se->avg.load_avg = load;
>> se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>
>> - /* Update parent cfs_rq load */
>> - add_positive(&cfs_rq->avg.load_avg, delta);
>> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> + if (propagate_avg) {
>> + /* Update parent cfs_rq load */
>> + add_positive(&cfs_rq->avg.load_avg, delta);
>> + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> + }
>>
>> /*
>> * If the sched_entity is already enqueued, we also have to update the
>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>> /* Update task and its cfs_rq load average */
>> static inline int propagate_entity_load_avg(struct sched_entity *se)
>> {
>> - struct cfs_rq *cfs_rq;
>> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> + bool propagate_avg;
>>
>> if (entity_is_task(se))
>> return 0;
>>
>> - if (!test_and_clear_tg_cfs_propagate(se))
>> - return 0;
>> -
>> - cfs_rq = cfs_rq_of(se);
>> + propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>
>> - set_tg_cfs_propagate(cfs_rq);
>> + /*
>> + * We want to keep @cfs_rq->runnable_load_avg always in sync so
>> + * that the load balancer can accurately determine the busiest CPU
>> + * regardless of cfs_rq nesting.
>> + */
>> + update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>
>> - update_tg_cfs_util(cfs_rq, se);
>> - update_tg_cfs_load(cfs_rq, se);
>> + if (propagate_avg) {
>> + set_tg_cfs_propagate(cfs_rq);
>> + update_tg_cfs_util(cfs_rq, se);
>> + }
>>
>> - return 1;
>> + return propagate_avg;
>> }
>>
>> #else /* CONFIG_FAIR_GROUP_SCHED */