Re: scheduler nice 19 versus 'idle' behavior / static low-priorityscheduling

From: Mike Galbraith
Date: Sat Jan 31 2009 - 04:08:47 EST


On Sat, 2009-01-31 at 06:38 +0100, Mike Galbraith wrote:
> On Fri, 2009-01-30 at 14:12 -0800, Brian Rogers wrote:
> > Mike Galbraith wrote:
> > > On Fri, 2009-01-30 at 02:59 -0500, Nathanael Hoyle wrote:
> > >
> > >> I am running foldingathome under it at the moment, and it seems to be
> > >> improving the situation somewhat, but I still need/want to test with
> > >> Mike's referenced patches.
> > >>
> > > You will most definitely encounter evilness running SCHED_IDLE tasks in
> > > a kernel without the SCHED_IDLE fixes.
> > >
> > Speaking of SCHED_IDLE fixes, is
> > 6bc912b71b6f33b041cfde93ca3f019cbaa852bc going to be put into the next
> > stable 2.6.28 release? Without it on 2.6.28.2, I can still produce
> > minutes-long freezes with BOINC or other idle processes.
> >
> > With the above commit on top of 2.6.28.2 and also
> > cce7ade803699463ecc62a065ca522004f7ccb3d, the problem is solved, though
> > I assume cce7ad isn't actually required to fix that, and I can test that
> > if desired.
>
> I think they both should go to stable, but dunno if they're headed that
> direction or not.
>
> One way to find out, CCs added.

For those who may want to run SCHED_IDLE tasks in .27, I've integrated
and lightly tested the fixes required to do so. One additional commit
was needed to get SCHED_IDLE vs nice 19 working right, namely f9c0b09.
Without that, SCHED_IDLE tasks received more CPU than nice 19 tasks.

Since .27 is in long-term maintenance, I'd integrate into stable, but
that's not my decision. Anyone who applies the below to their stable
kernel gets to keep all the pieces should something break ;-)

commit f9c0b0950d5fd8c8c5af39bc061f27ea8fddcac3
Author: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Date: Fri Oct 17 19:27:04 2008 +0200

sched: revert back to per-rq vruntime

Vatsa rightly points out that having the runqueue weight in the vruntime
calculations can cause unfairness in the face of task joins/leaves.

Suppose: dv = dt * rw / w

Then take 10 tasks t_n, each of similar weight. If the first will run 1
then its vruntime will increase by 10. Now, if the next 8 tasks leave after
having run their 1, then the last task will get a vruntime increase of 2
after having run 1.

Which will leave us with 2 tasks of equal weight and equal runtime, of which
one will not be scheduled for 8/2=4 units of time.

Ergo, we cannot do that and must use: dv = dt / w.

This means we cannot have a global vruntime based on effective priority, but
must instead go back to the vruntime per rq model we started out with.

This patch was lightly tested by doing starting while loops on each nice level
and observing their execution time, and a simple group scenario of 1:2:3 pinned
to a single cpu.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>

---
kernel/sched_fair.c | 32 +++++++++++++++-----------------
1 file changed, 15 insertions(+), 17 deletions(-)

Index: linux-2.6.27/kernel/sched_fair.c
===================================================================
--- linux-2.6.27.orig/kernel/sched_fair.c
+++ linux-2.6.27/kernel/sched_fair.c
@@ -334,7 +334,7 @@ int sched_nr_latency_handler(struct ctl_
#endif

/*
- * delta *= w / rw
+ * delta *= P[w / rw]
*/
static inline unsigned long
calc_delta_weight(unsigned long delta, struct sched_entity *se)
@@ -348,15 +348,13 @@ calc_delta_weight(unsigned long delta, s
}

/*
- * delta *= rw / w
+ * delta /= w
*/
static inline unsigned long
calc_delta_fair(unsigned long delta, struct sched_entity *se)
{
- for_each_sched_entity(se) {
- delta = calc_delta_mine(delta,
- cfs_rq_of(se)->load.weight, &se->load);
- }
+ if (unlikely(se->load.weight != NICE_0_LOAD))
+ delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);

return delta;
}
@@ -386,26 +384,26 @@ static u64 __sched_period(unsigned long
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
*
- * s = p*w/rw
+ * s = p*P[w/rw]
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
+ unsigned long nr_running = cfs_rq->nr_running;
+
+ if (unlikely(!se->on_rq))
+ nr_running++;
+
+ return calc_delta_weight(__sched_period(nr_running), se);
}

/*
* We calculate the vruntime slice of a to be inserted task
*
- * vs = s*rw/w = p
+ * vs = s/w
*/
-static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- unsigned long nr_running = cfs_rq->nr_running;
-
- if (!se->on_rq)
- nr_running++;
-
- return __sched_period(nr_running);
+ return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

/*
@@ -683,7 +681,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
- vruntime += sched_vslice_add(cfs_rq, se);
+ vruntime += sched_vslice(cfs_rq, se);

if (!initial) {
/* sleeps upto a single latency don't count. */
commit 1af5f730fc1bf7c62ec9fb2d307206e18bf40a69
Author: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Date: Fri Oct 24 11:06:13 2008 +0200

sched: more accurate min_vruntime accounting

Mike noticed the current min_vruntime tracking can go wrong and skip the
current task. If the only remaining task in the tree is a nice 19 task
with huge vruntime, new tasks will be inserted too far to the right too,
causing some interactibity issues.

min_vruntime can only change due to the leftmost entry disappearing
(dequeue_entity()), or by the leftmost entry being incremented past the
next entry, which elects a new leftmost (__update_curr())

Due to the current entry not being part of the actual tree, we have to
compare the leftmost tree entry with the current entry, and take the
leftmost of these two.

So create a update_min_vruntime() function that takes computes the
leftmost vruntime in the system (either tree of current) and increases
the cfs_rq->min_vruntime if the computed value is larger than the
previously found min_vruntime. And call this from the two sites we've
identified that can change min_vruntime.

Reported-by: Mike Galbraith <efault@xxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Acked-by: Mike Galbraith <efault@xxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>

---
kernel/sched_fair.c | 49 +++++++++++++++++++++++++------------------------
1 file changed, 25 insertions(+), 24 deletions(-)

Index: linux-2.6.27/kernel/sched_fair.c
===================================================================
--- linux-2.6.27.orig/kernel/sched_fair.c
+++ linux-2.6.27/kernel/sched_fair.c
@@ -221,6 +221,27 @@ static inline s64 entity_key(struct cfs_
return se->vruntime - cfs_rq->min_vruntime;
}

+static void update_min_vruntime(struct cfs_rq *cfs_rq)
+{
+ u64 vruntime = cfs_rq->min_vruntime;
+
+ if (cfs_rq->curr)
+ vruntime = cfs_rq->curr->vruntime;
+
+ if (cfs_rq->rb_leftmost) {
+ struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
+ struct sched_entity,
+ run_node);
+
+ if (vruntime == cfs_rq->min_vruntime)
+ vruntime = se->vruntime;
+ else
+ vruntime = min_vruntime(vruntime, se->vruntime);
+ }
+
+ cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+}
+
/*
* Enqueue an entity into the rb-tree:
*/
@@ -254,15 +275,8 @@ static void __enqueue_entity(struct cfs_
* Maintain a cache of leftmost tree entries (it is frequently
* used):
*/
- if (leftmost) {
+ if (leftmost)
cfs_rq->rb_leftmost = &se->run_node;
- /*
- * maintain cfs_rq->min_vruntime to be a monotonic increasing
- * value tracking the leftmost vruntime in the tree.
- */
- cfs_rq->min_vruntime =
- max_vruntime(cfs_rq->min_vruntime, se->vruntime);
- }

rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
@@ -272,18 +286,9 @@ static void __dequeue_entity(struct cfs_
{
if (cfs_rq->rb_leftmost == &se->run_node) {
struct rb_node *next_node;
- struct sched_entity *next;

next_node = rb_next(&se->run_node);
cfs_rq->rb_leftmost = next_node;
-
- if (next_node) {
- next = rb_entry(next_node,
- struct sched_entity, run_node);
- cfs_rq->min_vruntime =
- max_vruntime(cfs_rq->min_vruntime,
- next->vruntime);
- }
}

if (cfs_rq->next == se)
@@ -480,6 +485,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
schedstat_add(cfs_rq, exec_clock, delta_exec);
delta_exec_weighted = calc_delta_fair(delta_exec, curr);
curr->vruntime += delta_exec_weighted;
+ update_min_vruntime(cfs_rq);
}

static void update_curr(struct cfs_rq *cfs_rq)
@@ -666,13 +672,7 @@ static void check_spread(struct cfs_rq *
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
- u64 vruntime;
-
- if (first_fair(cfs_rq)) {
- vruntime = min_vruntime(cfs_rq->min_vruntime,
- __pick_next_entity(cfs_rq)->vruntime);
- } else
- vruntime = cfs_rq->min_vruntime;
+ u64 vruntime = cfs_rq->min_vruntime;

/*
* The 'current' period is already promised to the current tasks,
@@ -749,6 +749,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
account_entity_dequeue(cfs_rq, se);
+ update_min_vruntime(cfs_rq);
}

/*
commit e17036dac189dd034c092a91df56aa740db7146d
Author: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Date: Thu Jan 15 14:53:39 2009 +0100

sched: fix update_min_vruntime

Impact: fix SCHED_IDLE latency problems

OK, so we have 1 running task A (which is obviously curr and the tree is
equally obviously empty).

'A' nicely chugs along, doing its thing, carrying min_vruntime along as it
goes.

Then some whacko speed freak SCHED_IDLE task gets inserted due to SMP
balancing, which is very likely far right, in that case

update_curr
update_min_vruntime
cfs_rq->rb_leftmost := true (the crazy task sitting in a tree)
vruntime = se->vruntime

and voila, min_vruntime is waaay right of where it ought to be.

OK, so why did I write it like that to begin with...

Aah, yes.

Say we've just dequeued current

schedule
deactivate_task(prev)
dequeue_entity
update_min_vruntime

Then we'll set

vruntime = cfs_rq->min_vruntime;

we find !cfs_rq->curr, but do find someone in the tree. Then we _must_
do vruntime = se->vruntime, because

vruntime = min_vruntime(vruntime := cfs_rq->min_vruntime, se->vruntime)

will not advance vruntime, and cause lags the other way around (which we
fixed with that initial patch: 1af5f730fc1bf7c62ec9fb2d307206e18bf40a69
(sched: more accurate min_vruntime accounting).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Tested-by: Mike Galbraith <efault@xxxxxx>
Acked-by: Mike Galbraith <efault@xxxxxx>
Cc: <stable@xxxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>

---
kernel/sched_fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.27/kernel/sched_fair.c
===================================================================
--- linux-2.6.27.orig/kernel/sched_fair.c
+++ linux-2.6.27/kernel/sched_fair.c
@@ -233,7 +233,7 @@ static void update_min_vruntime(struct c
struct sched_entity,
run_node);

- if (vruntime == cfs_rq->min_vruntime)
+ if (!cfs_rq->curr)
vruntime = se->vruntime;
else
vruntime = min_vruntime(vruntime, se->vruntime);
commit 6bc912b71b6f33b041cfde93ca3f019cbaa852bc
Author: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Date: Thu Jan 15 14:53:38 2009 +0100

sched: SCHED_OTHER vs SCHED_IDLE isolation

Stronger SCHED_IDLE isolation:

- no SCHED_IDLE buddies
- never let SCHED_IDLE preempt on wakeup
- always preempt SCHED_IDLE on wakeup
- limit SLEEPER fairness for SCHED_IDLE.

Signed-off-by: Mike Galbraith <efault@xxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>

---
kernel/sched_fair.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

Index: linux-2.6.27/kernel/sched_fair.c
===================================================================
--- linux-2.6.27.orig/kernel/sched_fair.c
+++ linux-2.6.27/kernel/sched_fair.c
@@ -689,9 +689,13 @@ place_entity(struct cfs_rq *cfs_rq, stru
unsigned long thresh = sysctl_sched_latency;

/*
- * convert the sleeper threshold into virtual time
+ * Convert the sleeper threshold into virtual time.
+ * SCHED_IDLE is a special sub-class. We care about
+ * fairness only relative to other SCHED_IDLE tasks,
+ * all of which have the same weight.
*/
- if (sched_feat(NORMALIZED_SLEEPER))
+ if (sched_feat(NORMALIZED_SLEEPER) &&
+ task_of(se)->policy != SCHED_IDLE)
thresh = calc_delta_fair(thresh, se);

vruntime -= thresh;
@@ -1347,15 +1351,22 @@ static void check_preempt_wakeup(struct
if (unlikely(se == pse))
return;

- cfs_rq_of(pse)->next = pse;
+ if (likely(task_of(se)->policy != SCHED_IDLE))
+ cfs_rq_of(pse)->next = pse;

/*
- * Batch tasks do not preempt (their preemption is driven by
+ * Batch and idle tasks do not preempt (their preemption is driven by
* the tick):
*/
- if (unlikely(p->policy == SCHED_BATCH))
+ if (unlikely(p->policy != SCHED_NORMAL))
return;

+ /* Idle tasks are by definition preempted by everybody. */
+ if (unlikely(curr->policy == SCHED_IDLE)) {
+ resched_task(curr);
+ return;
+ }
+
if (!sched_feat(WAKEUP_PREEMPT))
return;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/