[PATCH] sched: sync with the prev cfs when changing cgroup within a cpu

From: byungchul . park
Date: Mon Aug 10 2015 - 02:10:21 EST


From: Byungchul Park <byungchul.park@xxxxxxx>

current code seems to be wrong with cfs_rq->blocked_load_avg when changing
a task's cgroup(=cfs_rq) to another. i tested with "echo pid > cgroup" and
found that cfs_rq->blocked_load_avg became larger and larger whenever i
changed a cgroup to another again and again.

it is possible to move between groups within a cpu, and each cfs_rq is
tracking its own blocked load. so we have to sync se's average load with
both *prev* cfs_rq and next cfs_rq when changing its group.

in addition, "#ifdef CONFIG_SMP" is removed becasuse we need to sync a
se's load with its cfs_rq even in the case of !SMP. remember it is possible
to move between groups in *a* cpu.

i also removed some comments mentioning migration_task_rq_fair().
migration_task_rq_fair() can be called in three cases. and in each case,
both decay counter and blocked load are already considered. so we
don't need to consider these in task_move_group_fair() at all.

1. the wake-up migration case
enqueue_entity_load_avg() makes se->avg.decay_count zero after applying it.
and it will be woken up soon so we don't need to add its load to
cfs_rq->blocked_load_avg.

2. the fork balancing case
se->avg.decay_count is initialized in __sched_fork() to zero. and
wake_up_new_task() calls activate_task() with flag = 0 so that
enqueue_entity_load_avg() can omit adding its load to
cfs_rq->blocked_load_avg, and it will be woken up soon.

3. the rq migration case (not wake up case)
the target task is already on rq, so we don't need to consider both its
decay counter and blocked load in this case.

Signed-off-by: Byungchul Park <byungchul.park@xxxxxxx>
---
kernel/sched/fair.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffa70dc..f8ab2ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8229,22 +8229,27 @@ static void task_move_group_fair(struct task_struct *p, int queued)
if (!queued && (!se->sum_exec_runtime || p->state == TASK_WAKING))
queued = 1;

- if (!queued)
- se->vruntime -= cfs_rq_of(se)->min_vruntime;
+ if (!queued) {
+ cfs_rq = cfs_rq_of(se);
+ se->vruntime -= cfs_rq->min_vruntime;
+
+ /*
+ * we must synchronize with the prev cfs.
+ */
+ __synchronize_entity_decay(se);
+ subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ }
set_task_rq(p, task_cpu(p));
se->depth = se->parent ? se->parent->depth + 1 : 0;
if (!queued) {
cfs_rq = cfs_rq_of(se);
se->vruntime += cfs_rq->min_vruntime;
-#ifdef CONFIG_SMP
+
/*
- * migrate_task_rq_fair() will have removed our previous
- * contribution, but we must synchronize for ongoing future
- * decay.
+ * we must synchronize with the next cfs for ongoing future decay.
*/
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
-#endif
}
}

--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/