Re: [PATCH 2/2] sched: Rewrite per entity runnable load average tracking

From: Peter Zijlstra
Date: Mon Jul 07 2014 - 06:46:59 EST


On Wed, Jul 02, 2014 at 10:30:56AM +0800, Yuyang Du wrote:
> The idea of per entity runnable load average (aggregated to cfs_rq and task_group load)
> was proposed by Paul Turner, and it is still followed by this rewrite. But this rewrite
> is made due to the following ends:
>
> (1). cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated
> incrementally by one entity at one time, which means the cfs_rq load average is only
> partially updated or asynchronous accross its entities (the entity in question is up
> to date and contributes to the cfs_rq, but all other entities are effectively lagging
> behind).
>
> (2). cfs_rq load average is different between top rq->cfs_rq and task_group's per CPU
> cfs_rqs in whether or not blocked_load_average contributes to the load.

ISTR there was a reason for it; can't remember though, maybe pjt/ben can
remember.

> (3). How task_group's load is tracked is very confusing and complex.
>
> Therefore, this rewrite tackles these by:
>
> (1). Combine runnable and blocked load averages for cfs_rq. And track cfs_rq's load average
> as a whole (contributed by all runnabled and blocked entities on this cfs_rq).
>
> (2). Only track task load average. Do not track task_group's per CPU entity average, but
> track that entity's own cfs_rq's aggregated average.
>
> This rewrite resutls in significantly reduced codes and expected consistency and clarity.
> Also, if draw the lines of previous cfs_rq runnable_load_avg and blocked_load_avg and the
> new rewritten load_avg, then compare those lines, you can see the new load_avg is much
> more continuous (no abrupt jumping ups and downs) and decayed/updated more quickly and
> synchronously.

OK, maybe seeing what you're doing. I worry about a fwe things though:

> +static inline void synchronize_tg_load_avg(struct cfs_rq *cfs_rq, u32 old)
> {
> + s32 delta = cfs_rq->avg.load_avg - old;
>
> + if (delta)
> + atomic_long_add(delta, &cfs_rq->tg->load_avg);
> }

That tg->load_avg cacheline is already red hot glowing, and you've just
increased the amount of updates to it.. That's not going to be pleasant.


> +static inline void enqueue_entity_load_avg(struct sched_entity *se)
> {
> + struct sched_avg *sa = &se->avg;
> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> + u64 now = cfs_rq_clock_task(cfs_rq);
> + u32 old_load_avg = cfs_rq->avg.load_avg;
> + int migrated = 0;
>
> + if (entity_is_task(se)) {
> + if (sa->last_update_time == 0) {
> + sa->last_update_time = now;
> + migrated = 1;
> }
> + else
> + __update_load_avg(now, sa, se->on_rq * se->load.weight);
> }
>
> + __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
>
> + if (migrated)
> + cfs_rq->avg.load_avg += sa->load_avg;
>
> + synchronize_tg_load_avg(cfs_rq, old_load_avg);
> }

So here you add the task to the cfs_rq avg when its got migrate in,
however:

> @@ -4552,17 +4326,9 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> struct sched_entity *se = &p->se;
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> + /* Update task on old CPU, then ready to go (entity must be off the queue) */
> + __update_load_avg(cfs_rq_clock_task(cfs_rq), &se->avg, 0);
> + se->avg.last_update_time = 0;
>
> /* We have migrated, no longer consider this task hot */
> se->exec_start = 0;

there you don't remove it first..

Attachment: pgpWt8D7zWUcC.pgp
Description: PGP signature