Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path

From: Vincent Guittot
Date: Wed May 10 2017 - 02:50:40 EST


Hi Tejun,

On 9 May 2017 at 18:18, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
> live cfs_rqs which have ever been active on the CPU; unfortunately,
> this makes update_blocked_averages() O(# total cgroups) which isn't
> scalable at all.

Dietmar raised similar optimization in the past. The only question was
: what is the impact of re-adding the cfs_rq in leaf_cfs_rq_list on
the wake up path ? Have you done some measurements ?

>
> This shows up as a small CPU consumption and scheduling latency
> increase in the load balancing path in systems with CPU controller
> enabled across most cgroups. In an edge case where temporary cgroups
> were leaking, this caused the kernel to consume good several tens of
> percents of CPU cycles running update_blocked_averages(), each run
> taking multiple millisecs.
>
> This patch fixes the issue by taking empty and fully decayed cfs_rqs
> off the rq->leaf_cfs_rq_list.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>
> Cc: Paul Turner <pjt@xxxxxxxxxx>
> Cc: Chris Mason <clm@xxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> ---
> Just refreshed on top of the first patch.
>
> kernel/sched/fair.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -369,8 +369,9 @@ static inline void list_del_leaf_cfs_rq(
> }
>
> /* Iterate thr' all leaf cfs_rq's on a runqueue */
> -#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> - list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
> + leaf_cfs_rq_list)
>
> /* Do the two (enqueued) entities belong to the same group ? */
> static inline struct cfs_rq *
> @@ -463,7 +464,7 @@ static inline void list_del_leaf_cfs_rq(
> {
> }
>
> -#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
>
> static inline struct sched_entity *parent_entity(struct sched_entity *se)
> @@ -6984,7 +6985,7 @@ static void attach_tasks(struct lb_env *
> static void update_blocked_averages(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> - struct cfs_rq *cfs_rq;
> + struct cfs_rq *cfs_rq, *pos;
> struct rq_flags rf;
>
> rq_lock_irqsave(rq, &rf);
> @@ -6994,7 +6995,7 @@ static void update_blocked_averages(int
> * Iterates the task_group tree in a bottom up fashion, see
> * list_add_leaf_cfs_rq() for details.
> */
> - for_each_leaf_cfs_rq(rq, cfs_rq) {
> + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> struct sched_entity *se;
>
> /* throttled entities do not contribute to load */
> @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int
> se = cfs_rq->tg->se[cpu];
> if (se && !skip_blocked_update(se))
> update_load_avg(se, 0);
> +
> + /*
> + * There can be a lot of idle CPU cgroups. Don't let fully
> + * decayed cfs_rqs linger on the list.
> + */
> + if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
> + !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
> + list_del_leaf_cfs_rq(cfs_rq);

list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up.
By removing cfs_rq, can't we break this assumption in some cases ?

Regards,
Vincent

> }
> rq_unlock_irqrestore(rq, &rf);
> }