Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()

From: Yuyang Du
Date: Wed Mar 29 2017 - 04:11:25 EST


Hi Peter,

On Tue, Mar 28, 2017 at 04:46:25PM +0200, Peter Zijlstra wrote:
> On Mon, Feb 13, 2017 at 05:44:23AM +0800, Yuyang Du wrote:
> > __update_load_avg() has the following steps:
> >
> > 1. add the remainder of the last incomplete period
> > 2. decay old sum
> > 3. accumulate new sum in full periods since last_update_time
> > 4. accumulate the current incomplete period
> > 5. update averages
> >
> > However, there is no need to separately compute steps 1, 3, and 4.
> >
> > Illustation:
> >
> > c1 c3 c4
> > ^ ^ ^
> > | | |
> > |<->|<----------------->|<--->|
> > ... |---x---|------| ... |------|-----x (now)
> >
> > c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
> > in timing in steps 1, 3, and 4 respectively.
> >
> > With them, the accumulated contribution to load_sum, for example, is:
> >
> > contrib = c1 * weight * freq_scaled;
> > contrib += c3 * weight * freq_scaled;
> > contrib += c4 * weight * freq_scaled;
> >
> > Obviously, we can optimize the above and they equate to:
> >
> > contrib = c1 + c3 + c4;
> > contrib *= weight * freq_scaled;
> >
>
> So I figured out what it is you're doing, how's this? I still need to
> rewrite the Changelog to make this cleared,

Yes, you need to, and let me do it too and learn how you will rewrite
it.

Subject: [PATCH] sched/fair: Optimize __update_sched_avg()

In __update_load_avg(), the flow of periods and update times are
illustrated as:

t1 t3 t4
^ ^ ^
| | |
|<->|<----------------->|<--->|
... |---x---|------| ... |------|-----x (now)

in which, t1 is the remainder of the last incomplete period, t2
is the full periods since the last_update_time, and t3 is the
current period.

These times altogether make contributions to the load_sum with
the following 5 steps:

step 1: t1 is decayed as c1
step 2: last sum is decayed
step 3: t3 has accumulated c3
step 4: t4 is c4
step 5: average the sum

These contributions are further scaled with weight and frequency:

contrib = c1 * weight * freq_scaled;
contrib += c3 * weight * freq_scaled;
contrib += c4 * weight * freq_scaled;

Obviously, they equate to:

contrib = c1 + c3 + c4;
contrib *= weight * freq_scaled;

By doing so, we save instructions, and clean the codes. With that, c1
must be additionally decayed as opposed to doing it in step 2, but
these two approaches are about the same if you compare decay_load()
with a round of contrib scaling.

Code size comparison (with allyesconfig):

Before: kernel/sched/built-in.o 1213304
After : kernel/sched/built-in.o 1212536 (-768B)

> but I think the code now has
> understandable comments.

Yes, you made it. It's much better now.

Thanks,
Yuyang