Re: [PATCH v3 9/9] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan

From: Vincent Guittot
Date: Thu Apr 16 2020 - 09:52:52 EST


On Thu, 16 Apr 2020 at 15:04, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 16.04.20 12:24, Valentin Schneider wrote:
> >
> > On 16/04/20 08:46, Vincent Guittot wrote:
> >>> @@ -6657,7 +6646,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>>
> >>> rcu_read_lock();
> >>>
> >>> - sd = highest_flag_domain(cpu, sd_flag);
> >>> + switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
> >>> + case WF_TTWU:
> >>> + sd_flag = SD_BALANCE_WAKE;
> >>> + sd = rcu_dereference(per_cpu(sd_balance_wake, cpu));
> >>
> >> It's worth having a direct pointer for the fast path which we always
> >> try to keep short but the other paths are already slow and will not
> >> get any benefit of this per cpu pointer.
> >> We should keep the loop for the slow paths
> >>
> >
> > Which fast/slow paths are you referring to here? want_affine vs
> > !want_affine? If so, do you then mean that we should do the switch case
> > only when !want_affine, and otherwise look for the domain via the
> > for_each_domain() loop?
>
> Coming back to the v2 discussion on this patch
>
> https://lore.kernel.org/r/20200311181601.18314-10-valentin.schneider@xxxxxxx
>
> SD_BALANCE_WAKE is not used in mainline anymore, so wakeups are always
> fast today.
>
> I.e. you wouldn't need a per_cpu(sd_balance_wake, cpu) since it's always
> NULL.
>
> I.e. want_affine logic and the 'for_each_domain(cpu, tmp)' isn't needed
> anymore.
>
> This will dramatically simplify the code in select_task_rq_fair().
>
> But I guess Vincent wants to keep the functionality so we're able to
> enable SD_BALANCE_WAKE on certain sd's?

I looked too quickly what was done by this patch. I thought that it
was adding a per_cpu pointer for all cases including the fast path
with wake affine but it only skips the for_each_domain loop for the
slow paths which don't need it because they are already slow.

It would be better to keep the for_each_domain loop for slow paths and
to use a per_cpu pointer for fast_path/wake affine. Regarding the
wake_affine path, we don't really care about looping all domains and
we could directly use the highest domain because wake_affine() that is
used in the loop, only uses the imbalance_pct of the sched domain for
wake_affine_weight() and it should not harm to use only the highest
domain and then select_idle_sibling doesn't use it but the llc or
asym_capacity pointer instead.