Re: [RFC patch] CFS fix place entity spread issue (v2)

From: Peter Zijlstra
Date: Mon Apr 19 2010 - 10:43:56 EST


On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:

OK, so looking purely at the patch:

> Index: linux-2.6-lttng.git/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6-lttng.git.orig/kernel/sched_fair.c 2010-04-18 01:48:04.000000000 -0400
> +++ linux-2.6-lttng.git/kernel/sched_fair.c 2010-04-18 08:58:30.000000000 -0400
> @@ -738,6 +738,14 @@
> unsigned long thresh = sysctl_sched_latency;
>
> /*
> + * Place the woken up task relative to
> + * min_vruntime + sysctl_sched_latency.
> + * We must _never_ decrement min_vruntime, because the effect is

Nobody I could find decrements min_vruntime, and certainly
place_entity() doesn't change min_vruntime. So this is a totally
mis-guided comment.

> + * that spread increases progressively under the Xorg workload.
> + */
> + vruntime += sysctl_sched_latency;

So in effect you change:
vruntime = max(vruntime, min_vruntime - thresh/2)
into
vruntime = max(vruntime, min_vruntime + thresh/2)

in a non-obvious way and unclear reason.

> + /*
> * Convert the sleeper threshold into virtual time.
> * SCHED_IDLE is a special sub-class. We care about
> * fairness only relative to other SCHED_IDLE tasks,
> @@ -755,6 +763,9 @@
> thresh >>= 1;
>
> vruntime -= thresh;
> +
> + /* ensure min_vruntime never go backwards. */
> + vruntime = max_t(u64, vruntime, cfs_rq->min_vruntime);

So the comment doesn't match the code, nor is it correct.

The code tries to implement clipping vruntime to min_vruntime, not
clipping min_vruntime, but then botches it by not taking wrap-around
into account.

Now, I know why your patch helps you (its in effect similar to what
START_DEBIT does for fork()), but getting the wakeup-preemption to do
something nice along with it is the hard part.

The whole perfectly fair scheduling thing is more-or-less doable
(dealing with tasks dying with !0-lag gets interesting, you'd have to
start adjusting global-timeline like things for that). But the thing is
that it makes for rather poor interactive behaviour.

Letting a task that sleeps long and runs short preempt heavier tasks
generally works well. Also, there's a number of apps that get a nice
boost from getting preempted before they can actually block on a
(read-like) systemcall, That saves a whole scheduler round-trip on the
wakeup side, so ping-pong like tasks love this too.

And then there is the whole signal delivery muck..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/