Re: [PATCH v3 06/10] sched/fair: Use the prefer_sibling flag of the current sched domain

From: Valentin Schneider
Date: Fri Feb 10 2023 - 12:14:29 EST


On 10/02/23 17:53, Peter Zijlstra wrote:
> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
>
>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
>> its parent), but remove it from the lowest non-degenerated topology level?
>
> So I was rather confused about the whole moving it between levels things
> this morning -- conceptually, prefer siblings says you want to try
> sibling domains before filling up your current domain. Now, balancing
> between siblings happens one level up, hence looking at child->flags
> makes perfect sense.
>
> But looking at the current domain and still calling it prefer sibling
> makes absolutely no sense what so ever.
>

True :-)

> In that confusion I think I also got the polarity wrong, I thought you
> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> want to force enable it as long as there is one SMT child around.
>
> Whichever way around it we do it, I'm thinking perhaps some renaming
> might be in order to clarify things.
>
> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> or something.
>

Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
something along those lines.

> OTOH, there's also
>
> if (busiest->group_weight == 1 || sds->prefer_sibling) {
>
> which explicitly also takes the group-of-one (the !child case) into
> account, but that's not consistently done.
>
> sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING;
>
> seems an interesting option,

> except perhaps ASYM_CPUCAPACITY -- I
> forget, can CPUs of different capacity be in the same leaf domain? With
> big.LITTLE I think not, they had their own cache domains and so you get
> at least MC domains per capacity, but DynamiQ might have totally wrecked
> that party.

Yeah, newer systems can have different capacities in one MC domain, cf:

b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")

>
>> (+ add it to the first NUMA level to keep things as they are, even if TBF I
>> find relying on it for NUMA balancing a bit odd).
>
> Arguably it ought to perhaps be one of those node_reclaim_distance
> things. The thing is that NUMA-1 is often fairly quick, esp. these days
> where it's basically on die numa.

Right, makes sense, thanks.