Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Tejun Heo
Date: Fri Apr 28 2017 - 12:15:32 EST


Hello, Vincent.

On Thu, Apr 27, 2017 at 10:28:01AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 02:30, Tejun Heo <tj@xxxxxxxxxx> wrote:
> > Hello, Vincent.
> >
> > On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
> >> > This is from the follow-up patch. I was confused. Because we don't
> >> > propagate decays, we still should decay the runnable_load_avg;
> >> > otherwise, we end up accumulating errors in the counter. I'll drop
> >> > the last patch.
> >>
> >> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
> >> see runnable_load_avg sometimes significantly higher than load_avg
> >> which is normally not possible as load_avg = runnable_load_avg +
> >> sleeping task's load_avg
> >
> > So, while load_avg would eventually converge on runnable_load_avg +
> > blocked load_avg given stable enough workload for long enough,
> > runnable_load_avg jumping above load avg temporarily is expected,
>
> No, it's not. Look at load_avg/runnable_avg at root domain when only
> task are involved, runnable_load_avg will never be higher than
> load_avg because
> load_avg = /sum load_avg of tasks attached to the cfs_rq
> runnable_load_avg = /sum load_avg of tasks attached and enqueued
> to the cfs_rq
> load_avg = runnable_load_avg + blocked tasks and as a result
> runnable_load_avg is always lower than load_avg.

We don't actually calculate that tho, but yeah, given the calculations
we do, runnable shouldn't be significantly higher than load_avg. I
see runable going above load_avg by a bit often but the margins are
small. I wonder whether you're seeing the errors accumulating from
the incorrect min() capping.

> And with the propagate load/util_avg patchset, we can even reflect
> task migration directly at root domain whereas we had to wait for the
> util/load_avg and runnable_load_avg to converge to the new value
> before
>
> Just to confirm one of my assumption, the latency regression was
> already there in previous kernel versions and is not a result of
> propagate load/util_avg patchset, isn't it ?

Yeah, the latency problem is independent of the propagation logic.
It's really the meaning of the top level running_avg_load being
different w/ and w/o cgroups.

> > AFAICS. That's the whole point of it, a sum closely tracking what's
> > currently on the cpu so that we can pick the cpu which has the most on
> > it now. It doesn't make sense to try to pick threads off of a cpu
> > which is generally loaded but doesn't have much going on right now,
> > after all.
>
> The only interest of runnable_load_avg is to be null when a cfs_rq is
> idle whereas load_avg is not but not to be higher than load_avg. The
> root cause is that load_balance only looks at "load" but not number of
> task currently running and that's probably the main problem:
> runnable_load_avg has been added because load_balance fails to filter
> idle group and idle rq. We should better add a new type in
> group_classify to tag group that are idle and the same in
> find_busiest_queue more.

I'll follow up in the other subthread but there really is fundamental
difference in how we calculate runnable_avg w/ and w/o cgroups.
Indepndent of whether we can improve the load balancer further, it is
an existing bug.

> > * Can you please double check that the higher latencies w/ the patch
> > is reliably reproducible? The test machines that I use have
> > variable management load. They never dominate the machine but are
> > enough to disturb the results so that to drawing out a reliable
> > pattern takes a lot of repeated runs. I'd really appreciate if you
> > could double check that the pattern is reliable with different run
> > patterns (ie. instead of 10 consecutive runs after another,
> > interleaved).
>
> I always let time between 2 consecutive run and the 10 consecutive
> runs take around 2min to execute
>
> Then I have run several time these 10 consecutive test and results stay the same

Can you please try the patch at the end of this message? I'm
wondering whether you're seeing the errors accumulating from the wrong
min().

> > ones. If there's something I can easily buy, please point me to it.
> > If there's something I can loan, that'd be great too.
>
> It looks like most of web site are currently out of stock

:(

> > * If not, I'll try to clean up the debug patches I have and send them
> > your way to get more visiblity but given these things tend to be
> > very iterative, it might take quite a few back and forth.
>
> Yes, that could be usefull. Even a trace of regression could be useful

Yeah, will clean it up a bit and post. Will also post some traces.

And here's the updated patch. I didn't add the suggested
scale_down()'s. I'll follow up on that in the other subthread.
Thanks.

kernel/sched/fair.c | 47 ++++++++++++++++++++---------------------------
1 file changed, 20 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
- long delta, load = gcfs_rq->avg.load_avg;
+ long load = 0, delta;

/*
- * If the load of group cfs_rq is null, the load of the
- * sched_entity will also be null so we can skip the formula
+ * A cfs_rq's load avg contribution to the parent should be scaled
+ * to the sched_entity's weight. Use freshly calculated shares
+ * instead of @se->load.weight as the latter may not reflect
+ * changes from the current scheduling operation.
+ *
+ * Note that the propagation source is runnable_load_avg instead of
+ * load_avg. This keeps every cfs_rq's runnable_load_avg true to
+ * the sum of the scaled loads of all tasks queued under it, which
+ * is important for the correct operation of the load balancer.
+ *
+ * This can make the sched_entity's load_avg jumpier but that
+ * correctly reflects what would happen without cgroups if each
+ * task's load is scaled across nesting - the load is being
+ * averaged at the task and each cfs_rq.
*/
- if (load) {
- long tg_load;
+ if (gcfs_rq->load.weight) {
+ long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);

- /* Get tg's load and ensure tg_load > 0 */
- tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
- /* Ensure tg_load >= load and updated with current load*/
- tg_load -= gcfs_rq->tg_load_avg_contrib;
- tg_load += load;
-
- /*
- * We need to compute a correction term in the case that the
- * task group is consuming more CPU than a task of equal
- * weight. A task with a weight equals to tg->shares will have
- * a load less or equal to scale_load_down(tg->shares).
- * Similarly, the sched_entities that represent the task group
- * at parent level, can't have a load higher than
- * scale_load_down(tg->shares). And the Sum of sched_entities'
- * load must be <= scale_load_down(tg->shares).
- */
- if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
- /* scale gcfs_rq's load into tg's shares*/
- load *= scale_load_down(gcfs_rq->tg->shares);
- load /= tg_load;
- }
+ load = min_t(long, scale_load_down(shares),
+ gcfs_rq->runnable_load_avg *
+ shares / gcfs_rq->load.weight);
}

delta = load - se->avg.load_avg;