RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort

From: Song Bao Hua (Barry Song)
Date: Thu Jan 28 2021 - 21:04:01 EST




> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@xxxxxxx]
> Sent: Friday, January 29, 2021 3:47 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>;
> linux-kernel@xxxxxxxxxxxxxxx
> Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx;
> dietmar.eggemann@xxxxxxx; morten.rasmussen@xxxxxxx; mgorman@xxxxxxx
> Subject: RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set
> for the deduplicating sort
>
> On 25/01/21 21:35, Song Bao Hua (Barry Song) wrote:
> > I was using 5.11-rc1. One thing I'd like to mention is that:
> >
> > For the below topology:
> > +-------+ +-----+
> > | node1 | 20 |node2|
> > | +----------+ |
> > +---+---+ +-----+
> > | |12
> > 12 | |
> > +---+---+ +---+-+
> > | | |node3|
> > | node0 | | |
> > +-------+ +-----+
> >
> > with node0-node2 as 22, node0-node3 as 24, node1-node3 as 22.
> >
> > I will get the below sched_domains_numa_distance[]:
> > 10, 12, 22, 24
> > As you can see there is *no* 20. So the node1 and node2 will
> > only get two-level numa sched_domain:
> >
>
>
> So that's
>
> -numa node,cpus=0-1,nodeid=0 -numa node,cpus=2-3,nodeid=1, \
> -numa node,cpus=4-5,nodeid=2, -numa node,cpus=6-7,nodeid=3, \
> -numa dist,src=0,dst=1,val=12, \
> -numa dist,src=0,dst=2,val=22, \
> -numa dist,src=0,dst=3,val=24, \
> -numa dist,src=1,dst=2,val=20, \
> -numa dist,src=1,dst=3,val=22, \
> -numa dist,src=2,dst=3,val=12
>
> but running this still doesn't get me a splat. Debugging
> sched_domains_numa_distance[] still gives me
> {10, 12, 20, 22, 24}
>
> >
> > But for the below topology:
> > +-------+ +-----+
> > | node0 | 20 |node2|
> > | +----------+ |
> > +---+---+ +-----+
> > | |12
> > 12 | |
> > +---+---+ +---+-+
> > | | |node3|
> > | node1 | | |
> > +-------+ +-----+
> >
> > with node1-node2 as 22, node1-node3 as 24,node0-node3 as 22.
> >
> > I will get the below sched_domains_numa_distance[]:
> > 10, 12, 20, 22, 24
> >
> > What I have seen is the performance will be better if we
> > drop the 20 as we will get a sched_domain hierarchy with less
> > levels, and two intermediate nodes won't have the group span
> > issue.
> >
>
> That is another thing that's worth considering. Morten was arguing that if
> the distance between two nodes is so tiny, it might not be worth
> representing it at all in the scheduler topology.

Yes. I agree it is a different thing. Anyway, I saw your patch has been
in sched tree. One side effect your patch is the one more sched_domain
level is imported for this topology:

24
X X XXX X X X X X X XXX
XX XX X XXXXX
XXX X
XX XXX
XX 22 XXX
X XXXXXXX XX
X XXXXX XXXXXXXXX XXXX
XX XXX XX X XX X XX
+--------+ +---------+ +---------+ XX+---------+
| 0 | 12 | 1 | 20 | 2 | 12 |3 |
| +-----------+ +----------+ +--------+ |
+---X----+ +---------+ +--X------+ +---------+
X X
XX X
X XX
XX XX
XX X
X XXX XXX
X XXXXXX XX XX X X X XXXX
22
Without the patch, Linux will use 10,12,22,24 to build sched_domain;
With your patch, Linux will use 10,12,20,22,24 to build sched_domain.

So one more layer is added. What I have seen is that:

For node0 sched_domain <=12 and sched_domain <=20 span the same range
(node0, node1). So one of them is redundant. then in cpu_attach_domain,
the redundant one is dropped due to "remove the sched domains which
do not contribute to scheduling".

For node1&2, the origin code had no "20", thus built one less sched_domain
level.

What is really interesting is that removing 20 actually gives better
benchmark in speccpu :-)


>
> > Thanks
> > Barry

Thanks
Barry