Re: [PATCH v2] sched/fair: update scale invariance of PELT

From: Peter Zijlstra
Date: Tue Apr 11 2017 - 06:42:01 EST


On Tue, Apr 11, 2017 at 11:40:21AM +0200, Vincent Guittot wrote:
> Le Tuesday 11 Apr 2017 à 10:53:05 (+0200), Peter Zijlstra a écrit :
> > On Tue, Apr 11, 2017 at 09:52:21AM +0200, Vincent Guittot wrote:
> > > Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :
> > > >
> > > > Thanks for the rebase.
> > > >
> > > > On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:
> > > >
> > > > Ok, so let me try and paraphrase what this patch does.
> > > >
> > > > So consider a task that runs 16 out of our 32ms window:
> > > >
> > > > running idle
> > > > |---------|---------|

(A)

> > > >
> > > >
> > > > You're saying that when we scale running with the frequency, suppose we
> > > > were at 50% freq, we'll end up with:
> > > >
> > > > run idle
> > > > |----|---------|

(B)

> > > >
> > > >
> > > > Which is obviously a shorter total then before; so what you do is add
> > > > back the lost idle time like:
> > > >
> > > > run lost idle
> > > > |----|----|---------|

(C)

> > > >
> > > >
> > > > to arrive at the same total time. Which seems to make sense.
> > >
> > > Yes
> >
> > OK, bear with me.
> >
> >
> > So we have:
> >
> >
> > util_sum' = utilsum * y^p +
> >
> > p-1
> > d1 * y^p + 1024 * \Sum y^n + d3 * y^0
> > n=1
> >
> > For the unscaled version, right?
>
> Yes for the running state.
>
> For sleeping state, it's just util_sum' = utilsum * y^p

Sure, and from this is follows that for idle time we add 0, while we do
decay. Lets call this (1).

> >
> >
> >
> > Now for the scaled version, instead of adding a full 'd1,d2,d3' running
> > segments, we want to add partially running segments, where r=f*d/f_max,
> > and lost segments l=d-r to fill out the idle time.
> >
> > But afaict we then end up with (F=f/f_max):
> >
> >
> > util_sum' = utilsum * y^p +
> >
> > p-1
> > F * d1 * y^p + F * 1024 * \Sum y^n + F * d3 * y^0
> > n=1
>
> you also have a longer running time as it runs slower. We make the assumption that
> d1+d2+d3 = (d1'+d2'+d3') * F

No, d's stay the same length, r's are the scaled d, and l's the
remainder, or lost idle time.

That is; r + l = d, that way the time scale stays invariant as above (A)
& (C).

So if we run slower, we scale back r and l becomes !0.

> If we consider that we cross a decay window, we still have the d1 to
> complete the past one but then p'*F= p and d'3 will be the remaining
> part of the current window and most probably not equal to d3

So by doing r=Fd we get some (lost) idle time for every bit of runtime,
equally distributed, as if the CPU inserted NOP cycles to lower the
effective frequency.

You want to explicitly place the idle time at the end? That would bias
the sum downwards. To what point?

> so we have with current invariance:
>
> util_sum' = utilsum * y^(p/F) +
> (p/F - 1)
> F * d1 * y^(p/F) + F * 1024 * \Sum y^n + F * d'3 * y^0
> n=1

No, we don't have p/F. p is very much _NOT_ scaled.

Look at accumulate_sum(), we compute p from d, not r.

> >
> > And we can collect the common term F:
> >
> > util_sum' = utilsum * y^p +
> >
> > p-1
> > F * (d1 * y^p + 1024 * \Sum y^n + d3 * y^0)
> > n=1
> >
> >
> > Which is exactly what we already did.
>
> In the new invariance scale, the F is applied on p not on the contribution
> value

That's wrong... That would result in (B) not (C).