Re: [RFC PATCH] sched: reflect sched_entity movement into task_group's utilization

From: Vincent Guittot
Date: Wed May 11 2016 - 12:53:07 EST


Hi Dietmar

On 11 May 2016 at 16:45, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> Hi Vincent,
>
> On 04/05/16 08:17, Vincent Guittot wrote:
>> Ensure that changes of the utilization of a sched_entity will be
>> reflected in the task_group hierarchy down to the root cfs.
>>
>> This patch tries another way than the flat utilization hierarchy proposal to
>> ensure that the changes will be propagated down to the root cfs.
>
> IMHO, the biggest advantage over the flat utilization hierarchy approach
> is that you don't have to sync the sched_avg::last_update_time values
> between se's and cfs_rq's.

I agree

>
>> The way to compute the sched average metrics stays the same so the utilization
>> only need to be synced with the local cfs rq timestamp.
>>
>> We keep an estimate of the utilization of each task group which is not used
>> for now but might be usefull in the future even if i don't have idea so far.
>
> A simple test case (a 50% task (run/period 8/16ms) switching between 2
> cpus every 160ms (due to restricted cpu affinity to one of these cpus)
> and running in tg_root/tg_1) shows the aimed behaviour at the root
> cfs_rq (immediate utilization drop from ~550 to 0 on the src cpu and
> immediate utilization rise from 0 to ~550 on the dst cpu in case of a
> migration from src to dst cpu.
>
> But in case I run the same task in tg_root/tg_1/tg_11 , then the
> utilization of the root cfs_rq looks like the one of a system w/o the
> patch. The problem seems to be that cfs_rq->diff_util_avg (owned by
> tg_root/tg_1 on cpuX) is not 0 (instead it has ~ -470) in case the task
> gets enqueued on cpuX. So the utilization signal of root cfs_rq ramps up
> slowly and doesn't drop in case the task migrates to the other cpu.

Strange, i have done my test with 2 tg levels and I can see the
increase of utilization of the dest cpu with the migration of the
task.
In the tests i have done, the increase on the dest CPU's utilization
happens directly and the decrease of the src CPU's utilization
occurs during the next update when removed_util_avg is applied. My
test was involving several tasks so the migration was happening at
task wake up whereas you changes the task affinity to force migration
in your test IIUC ?

I'm going to try to run your use case and reproduce your issue

>
> [...]
>
>> +/*
>> + * Save how much utilization has just been added/removed on cfs rq so we can
>> + * propagate it across the whole tg tree
>> + */
>> +static void set_tg_cfs_util(struct cfs_rq *cfs_rq, int delta)
>
> Maybe s/cfs_cfs_rq ?
>
>> +{
>> + if (!cfs_rq->tg)
>
> I guess here you want to bail if cfs_rq is the root cfs_rq. The current
> condition will always be true if CONFIG_FAIR_GROUP_SCHED is set. For the
> root cfs_rq cfs->tg is equal to &root_task_group.
>
> Since you don't need diff_util_avg on the root cfs_rq, you could use
> cfs_rq->tg == &root_task_group or &rq->cfs == cfs_rq or
> !cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]

good catch

>
> It doesn't harm functionality though, it's just the fact that you update
> cfs_rq->diff_util_avg for the root cfs_rq needlessly.
>
>> + return;
>> +
>> + cfs_rq->diff_util_avg += delta;
>> +}
>> +
>> +/*
>> + * Look at pending utilization change in the cfs rq and reflect it in the
>> + * sched_entity that represents the cfs rq at parent level
>
> Not sure if I understand the 'parent level' here correctly. For me, this
> se, the cfs_rq it 'owns' (se->my_q or group_cfs_rq(se)) and the tg
> (cfs_rq->tg) are at the same level.
>
> se->parent, se->cfs_rq (or cfs_rq_of(se)), cfs_rq->tg->parent would be
> the parent level.
>
> So for me update_tg_se_util() operates in one level and the update of
> the parent level would happen in the caller functon, e.g.
> attach_entity_load_avg().

This function checks if any direct changes happens in a
group_cfs_rq(se) and applied the same changes in its se. The latter
represents how group_cfs_rq(se) is seen in its parent cfs_rq_of(se).
I'm going to update the comments to make it clearer

>
>> + */
>> +static int update_tg_se_util(struct sched_entity *se)
>> +{
>> + struct cfs_rq *cfs_rq;
>> + long update_util_avg;
>> + long old_util_avg;
>> +
>> + if (entity_is_task(se))
>> + return 0;
>> +
>> + cfs_rq = se->my_q;
>
> Minor thing, Couldn't you use group_cfs_rq(se) here?

yes

>
>> +
>> + update_util_avg = cfs_rq->diff_util_avg;
>> +
>> + if (!update_util_avg)
>> + return 0;
>> +
>> + cfs_rq->diff_util_avg = 0;
>> +
>> + old_util_avg = se->avg.util_avg;
>> + se->avg.util_avg = max_t(long, se->avg.util_avg + update_util_avg, 0);
>> + se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
>> +
>> + return se->avg.util_avg - old_util_avg;
>> +}
>> +
>> +
>> +/* Take into account the change of the utilization of a child task group */
>> +static void update_tg_cfs_util(struct sched_entity *se)
>> +{
>> + int delta;
>> + struct cfs_rq *cfs_rq;
>> +
>> + if (!se)
>> + return;
>
> This condition is only necessary for the call from
> update_blocked_averages() for a root cfs_rq, I guess?

yes. I will add a comment

>
> [...]
>
>> @@ -6260,6 +6348,8 @@ static void update_blocked_averages(int cpu)
>>
>> if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
>> update_tg_load_avg(cfs_rq, 0);
>> + /* Propagate pending util changes to the parent */
>> + update_tg_cfs_util(cfs_rq->tg->se[cpu_of(rq)]);
>
> Couldn't cpu_of(rq)] not just be cpu?

yes

Thanks for the comments

>
> [...]