Re: sched: fix/optimise some issues

From: Peter Zijlstra
Date: Thu Jul 21 2011 - 12:32:53 EST

On Thu, 2011-07-21 at 18:36 +0200, Stephan BÃrwolf wrote:
> > Right, so I've often wanted a [us]128 type, and gcc has some (broken?)
> > support for that, but overhead has always kept me from it.
> 128bit sched_vruntime_t support seems to be running fine, when compiled with
> gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5.
> Of course overhead is a problem (but there is also overhead using u64 on
> x86),

Yeah, I know, but luckily all 32bit computing shall die sooner rather
than later. But there really wasn't much choice there anyway, 32bit
simply won't do.

> that is why it should be Kconfig selectable (for servers with many
> processes,
> deep cgroups and many different priorities?).

Sadly that's not how things work in practice, distro's will have to
enable the option and that means that pretty much everybody runs it. The
whole cgroup crap is already _way_ too expensive.

> But I think also abstracting the whole vruntime-stuff into a seperate
> collection
> simplifies further evaluations and adpations. (Think of central
> statistics collection
> for example maximum timeslice seen or happened overflows - without changing
> all the lines of code with the risk of missing sth.)

It made rather a mess of things,

> > There's also the non-atomicy thing to consider, see min_vruntime_copy
> > etc.
> I think atomicy is not an (great) issue, because of two reasons:
> a) on x86 the u64 wouldn't be atomic, too (vruntime is u64 not
> atomic64_t)

atomic64_t isn't needed in order to guarantee consistent loads, Linux
depends on the fact that all naturally aligned loads are complete loads
(no partials etc.).

> b) every operation on cfs_rq->min_vruntime should happen, when
> holding the runqueue-lock?.

commit 3fe1698b7fe05aeb063564e71e40d09f28d8e80c
Author: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Date: Tue Apr 5 17:23:48 2011 +0200

sched: Deal with non-atomic min_vruntime reads on 32bits

In order to avoid reading partial updated min_vruntime values on 32bit
implement a seqcount like solution.

Reviewed-by: Frank Rowand <frank.rowand@xxxxxxxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Nick Piggin <npiggin@xxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>

diff --git a/kernel/sched.c b/kernel/sched.c
index 46f42ca..7a5eb26 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -312,6 +312,9 @@ struct cfs_rq {

u64 exec_clock;
u64 min_vruntime;
+#ifndef CONFIG_64BIT
+ u64 min_vruntime_copy;

struct rb_root tasks_timeline;
struct rb_node *rb_leftmost;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ad4c414f..054cebb 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -358,6 +358,10 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)

cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+#ifndef CONFIG_64BIT
+ smp_wmb();
+ cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;

@@ -1376,10 +1380,21 @@ static void task_waking_fair(struct task_struct *p)
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 min_vruntime;

- lockdep_assert_held(&task_rq(p)->lock);
+#ifndef CONFIG_64BIT
+ u64 min_vruntime_copy;

- se->vruntime -= cfs_rq->min_vruntime;
+ do {
+ min_vruntime_copy = cfs_rq->min_vruntime_copy;
+ smp_rmb();
+ min_vruntime = cfs_rq->min_vruntime;
+ } while (min_vruntime != min_vruntime_copy);
+ min_vruntime = cfs_rq->min_vruntime;
+ se->vruntime -= min_vruntime;


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at