Re: [RFC patch] CFS fix place entity spread issue (v2)

From: Peter Zijlstra
Date: Mon Apr 19 2010 - 05:25:44 EST


On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:

> Detailed explanation in my ELC2010 presentation at:
>
> http://www.efficios.com/elc2010

So what happened to comprehensive changelogs?

> I've torn the CFS scheduler apart in the past days to figure out what is causing
> this weird behavior, and the culprit seems to be place_entity(). The problem
> appears to be the cumulative effect of letting the min_vruntime go backward when
> putting sleepers back on the runqueue.

Right, that should not happen, ever, min_vruntime should be monotonic.
Your explanation fails to mention how exactly this happens.

> It lets the vruntime spread grow to
> "entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).
>
> In the original code, a max between the sched entity vruntime and the calculated
> vruntime was supposed to "ensure that the thread time never go backward". But I
> don't see why we even care about that.

Fairness, a task's vruntime going backwards means it can obtain more
cputime than a while(1) loop would get, not a good thing.

> The key point is that the min_vruntime
> of the runqueue should not go backward.

That is certainly important too.

> I propose to fix this by calculating the relative offset from
> min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
> also ensure that the value never goes below min_vruntime.

I'm not sure that's a good idea, that'll end up being a 'fairer'
scheduler in the strict sense, but I suspect it will break the wakeup
preemption model we have.

> Under the Xorg workload, moving a few windows around and starting firefox while
> executing the wakeup-latency.c program (program waking up every 10ms and
> reporting wakeup latency), this patch brings worse latency from 60ms down to
> 12ms. Even doing a kernel compilation at the same time, the worse latency stays
> around 20ms now.

That seems nice enough. But really, this changelog doesn't explain the
cause at all.

/me goes try and figure out where this min_vruntime goes backwards.

So update_min_vruntime() is the sole place where we change min_vruntime,
it is called from update_curr() where we add runtime to the task we're
currently servicing, and dequeue_entity() where we remove someone from
the tree.

These two sites are the only places where min_vruntime can change
because only adding service to the leftmost task, or removal thereof can
make the min_vruntime go forward.

So update_min_vruntime():

- takes the smaller vruntime between the current task and the leftmost
tree entry (current is not in the tree, and we can pass beyond the
leftmost waiting task due to limited preemption and minimum service
windows).

- set min_vruntime to the larger between the existing min_vruntime and
the smallest vruntime.

So colour me confused if I'm not seeing min_vruntime going backwards...



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/