Re: [patch v8 3/9] sched: set initial value of runnable avg for newforked task

From: Peter Zijlstra
Date: Mon Jun 17 2013 - 05:21:40 EST


On Fri, Jun 14, 2013 at 06:02:45PM +0800, Lei Wen wrote:
> Hi Alex,
>
> On Fri, Jun 7, 2013 at 3:20 PM, Alex Shi <alex.shi@xxxxxxxxx> wrote:
> > We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> > new forked task.
> > Otherwise random values of above variables cause mess when do new task
> > enqueue:
> > enqueue_task_fair
> > enqueue_entity
> > enqueue_entity_load_avg
> >
> > and make forking balancing imbalance since incorrect load_avg_contrib.
> >
> > Further more, Morten Rasmussen notice some tasks were not launched at
> > once after created. So Paul and Peter suggest giving a start value for
> > new task runnable avg time same as sched_slice().
>
> I am confused at this comment, how set slice to runnable avg would change
> the behavior of "some tasks were not launched at once after created"?
>
> IMHO, I could only tell that for the new forked task, it could be run if current
> task already be set as need_resched, and preempt_schedule or
> preempt_schedule_irq
> is called.
>
> Since the set slice to avg behavior would not affect this task's vruntime,
> and hence cannot make current running task be need_sched, if
> previously it cannot.


So the 'problem' is that our running avg is a 'floating' average; ie. it
decays with time. Now we have to guess about the future of our newly
spawned task -- something that is nigh impossible seeing these CPU
vendors keep refusing to implement the crystal ball instruction.

So there's two asymptotic cases we want to deal well with; 1) the case
where the newly spawned program will be 'nearly' idle for its lifetime;
and 2) the case where its cpu-bound.

Since we have to guess, we'll go for worst case and assume its
cpu-bound; now we don't want to make the avg so heavy adjusting to the
near-idle case takes forever. We want to be able to quickly adjust and
lower our running avg.

Now we also don't want to make our avg too light, such that it gets
decremented just for the new task not having had a chance to run yet --
even if when it would run, it would be more cpu-bound than not.

So what we do is we make the initial avg of the same duration as that we
guess it takes to run each task on the system at least once -- aka
sched_slice().

Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
case it should be good enough.


Does that make sense?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/