[RFC] sched/fair: Align vruntime to last_se when curr_se's timeslice run out

From: weiqi (C)
Date: Wed Dec 05 2018 - 07:41:44 EST


Hello,
I measuring the cfs scheduling delay, and found that
when a cfs_rq contains multiple tasks, it does not guarantee
every se of this cfs_rq can get a changce to run within
one sched_latency period. Assume that there are three se
A, B, and C in cfs_rq, and the default sched_latency=24ms.
In ideal state, A, B, and C take turns to get 8ms, The timing is:
A(8ms) -> B(8ms) -> C(8ms) -> A(8ms) -> B(8ms) -> C(8ms) ....
So, the delay for each task is approximately 16ms.

However, check_preempt_tick() is the clock interrupt granularity,
and the clock interrupt may also have an indeterminate fine delay,
so the time slice actually expires relative
to the ideal expiration time delay uncertain time.
This may result in a timing of :
A -> B -> C -> B -> C -> A
which increases the scheduling delay of A to 32ms.


The more extreme scenario is wakeup sleepers. If a sleeper task
wakes up to a cpu, its vruntime will be sched_latency/2 less
than cfs_rq->min_vruntime /* place_entity() function do this */
In the scene of the above A, B, C cycle time slice,
a sleeper thread D woken up
D->vruntime = cfs_rq->min_vruntime - sched_latency/2
then the timing of the pick_task may be:

A(8ms)->B(8ms)->C(8ms)->D(6ms)->D(6ms)->D(6ms,then dequeue)->C(8ms)-> B(8ms)->A

causing A's scheduling delay to reach 50ms
so, I make changes to check_preempt_tick()ï

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee271bb..1f61b9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4020,7 +4020,23 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
+ struct rb_node *next = NULL;
+ struct rb_node *right_most = NULL;
+ struct sched_entity *last;
resched_curr(rq_of(cfs_rq));
+
+ /* always set to max vruntime */
+ if (cfs_rq->nr_running > 1) {
+ next = &curr->run_node;
+ do {
+ right_most = next;
+ next = rb_next(next);
+ } while (next);
+
+ last = rb_entry(right_most,
+ struct sched_entity, run_node);
+ curr->vruntime = last->vruntime + 1; // maybe +1 is not needed
+ }
+
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.


In this way, time sliced task in one sched_latency period will not get runtime again,
And for wake sleeper, it can meet the needs of wake-up preemption, and this will not let it
take too much time in a scheduling period.

I wrote a small press program, bind it to 12 cpus, this program wake up 36 sub-processes
every second and each sub-process work 120ms, looping all the time..
Iâve tested on linux-stable 4.20-rc1 kernel, the maximum delay of press program is about 80ms.
After applying the above modifications, the maximum delay is stable at 27ms.