Re: [PATCH v3 06/10] sched/fair: Use the prefer_sibling flag of the current sched domain

From: Ricardo Neri
Date: Thu Feb 16 2023 - 00:11:19 EST


On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote:
> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
> > On 10/02/2023 19:31, Ricardo Neri wrote:
> > > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> > >> On 10/02/23 17:53, Peter Zijlstra wrote:
> > >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> > >>>
> > >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> > >>>> its parent), but remove it from the lowest non-degenerated topology level?
> > >>>
> > >>> So I was rather confused about the whole moving it between levels things
> > >>> this morning -- conceptually, prefer siblings says you want to try
> > >>> sibling domains before filling up your current domain. Now, balancing
> > >>> between siblings happens one level up, hence looking at child->flags
> > >>> makes perfect sense.
> > >>>
> > >>> But looking at the current domain and still calling it prefer sibling
> > >>> makes absolutely no sense what so ever.
> > >>>
> > >>
> > >> True :-)
> > >>
> > >>> In that confusion I think I also got the polarity wrong, I thought you
> > >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> > >>> want to force enable it as long as there is one SMT child around.
> > >
> > > Exactly.
> > >
> > >>>
> > >>> Whichever way around it we do it, I'm thinking perhaps some renaming
> > >>> might be in order to clarify things.
> > >>>
> > >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> > >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> > >>> or something.
> > >>>
> > >>
> > >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
> > >> something along those lines.
> > >
> > > I sense a consesus towards SD_SPREAD_TASKS.
> >
> > Can you not detect the E-core dst_cpu case on MC with:
> >
> > + if (child)
> > + sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
> > + else if (sds->busiest)
> > + sds->prefer_sibling = sds->busiest->group_weight > 1;
>
> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
> is set based on the flags of the destination CPU's sched domain. But when
> used in find_busiest_group() tasks are spread from the busiest group's
> child domain.
>
> Your proposed code, also needs a check for SD_PREFER_SIBLING, no?

I tweaked the solution that Dietmar proposed:

- sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+ if (sds->busiest)
+ sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;

This comes from the observation that the prefer_sibling setting acts on
busiest group. It then depends on whether the busiest group, not the local
group, has child sched sched domains. Today it works because in most cases
both the local and the busiest groups have child domains with the SD_
PREFER_SIBLING flag.

This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
prefer_sibling would not be set in that case.

It would also conserve the current behavior at the NUMA level. We would
not need to implement SD_SPREAD_TASKS.

This would both fix the SMT vs non-SMT bug and be less invasive.

Thoughts?

Thanks and BR,
Ricardo