Re: sched: fix/optimise some issues

From: Stephan BÃrwolf
Date: Thu Jul 21 2011 - 12:21:31 EST

Thank you for your fast response and your detailed comments.

On 07/21/11 17:08, Peter Zijlstra wrote:
> On Wed, 2011-07-20 at 15:42 +0200, Stephan BÃrwolf wrote:
>> I also implemented an 128bit vruntime support:
>> Majorly on systems with many tasks and (for example) deep cgroups
>> (or increased NICE0_LOAD/ SCHED_LOAD_SCALE as in commit
>> c8b281161dfa4bb5d5be63fb036ce19347b88c63), a weighted timeslice
>> (unsigned long) can become very large (on x86_64) and consumes a
>> large part of the u64 vruntimes (per tick) when added.
>> This might lead to missscheduling because of overflows.
> Right, so I've often wanted a [us]128 type, and gcc has some (broken?)
> support for that, but overhead has always kept me from it.
128bit sched_vruntime_t support seems to be running fine, when compiled with
gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5.
Of course overhead is a problem (but there is also overhead using u64 on
that is why it should be Kconfig selectable (for servers with many
deep cgroups and many different priorities?).

But I think also abstracting the whole vruntime-stuff into a seperate
simplifies further evaluations and adpations. (Think of central
statistics collection
for example maximum timeslice seen or happened overflows - without changing
all the lines of code with the risk of missing sth.)
> There's also the non-atomicy thing to consider, see min_vruntime_copy
> etc.
I think atomicy is not an (great) issue, because of two reasons:
a) on x86 the u64 wouldn't be atomic, too (vruntime is u64 not
b) every operation on cfs_rq->min_vruntime should happen, when
holding the runqueue-lock?.
> How horrid is the current vruntime situation?
This is a point, which needs further discussion/observation.

When for example NICE0_LOAD is increased by 6 Bit (and I think
"c8b281161dfa4bb5d5be63fb036ce19347b88c63" did it by 10bits
on x86_64) the maximum timeslice (I am not quite sure if it was on
HZ=1000) with a PRE kernel will be around 2**38.
Adding this every ms (lets say 1024 times per sec) to min_vruntime
might cause overflows too fast (after 2**(63-38-10)sec = 2**15sec ~ 9h).
Having a great heterogenity of priorities may intensify this situation...

Long story short: on x86_64 an unsigned long (timeslice) could be
as large as the whole u64 for min_vruntime and this is dangerous.

Of course limiting the maximum timeslice in "calc_delta_mine()" would
help, too - but without the comfort using the whole x86_64 capabilties.
(and mostly therefore finer priority-resolutions)
> As to your true-idle, there's a very good reason the current SCHED_IDLE
> isnt' a true idle scheduler; it would create horrid priority inversion
> problems, imagine the true idle task holding a mutex or is required to
> complete something.
Of course, I fully agree! This is one reason why it was marked as
"experimental". When having a few backgroundjobs (for example
a boinc or a bitcoin-crunsher ;-) ) it works ok because there seems
not to many process-spanned lockings.
But in general it is a bad idea...

I also remember weak Linus had sth. against "priority inheritance"
(don't ask me what or why - I don't know),
but it would be an honour to me working with you guys to implement
this feature in future kernels. (On the base of rb-trees saving the
priorities of each "se" holding the lock, to solve prio.inv. ? or in
non-schedulable contextes maybe setting an "super-priority" while locking)

I think real idle-scheduling (maybe based in more than one idle-levels)
be a very great feature to future kernels.
(For example utilizing expensive systems without feelable affects on
Even because SMP gains more and
more importance (plus increasing cpus/cores) and the "load-balancing"
often leads to short
but great idle-phases on sparse (because of interactivity) processed


regards Stephan

Dipl.-Inf. Stephan BÃrwolf
Ilmenau University of Technology, Integrated Communication Systems Group
Phone: +49 (0)3677 69 4130
Email: stephan.baerwolf@xxxxxxxxxxxxx,

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at