RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort

From: Song Bao Hua (Barry Song)
Date: Mon Jan 25 2021 - 17:32:40 EST




> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@xxxxxxx]
> Sent: Tuesday, January 26, 2021 5:46 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>;
> linux-kernel@xxxxxxxxxxxxxxx
> Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx;
> dietmar.eggemann@xxxxxxx; morten.rasmussen@xxxxxxx; mgorman@xxxxxxx
> Subject: RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for
> the deduplicating sort
>
> On 25/01/21 09:26, Valentin Schneider wrote:
> > On 25/01/21 02:23, Song Bao Hua (Barry Song) wrote:
> >
> >> with the below topology:
> >> qemu-system-aarch64 -M virt -nographic \
> >> -smp cpus=8 \
> >> -numa node,cpus=0-1,nodeid=0 \
> >> -numa node,cpus=2-3,nodeid=1 \
> >> -numa node,cpus=4-5,nodeid=2 \
> >> -numa node,cpus=6-7,nodeid=3 \
> >> -numa dist,src=0,dst=1,val=12 \
> >> -numa dist,src=0,dst=2,val=20 \
> >> -numa dist,src=0,dst=3,val=22 \
> >> -numa dist,src=1,dst=2,val=22 \
> >> -numa dist,src=2,dst=3,val=12 \
> >> -numa dist,src=1,dst=3,val=24 \
> >>
> >>
> >> The panic address is *1294:
> >>
> >> if (sdd->sd) {
> >> 1280: f9400e61 ldr x1, [x19, #24]
> >> 1284: b4000201 cbz x1, 12c4 <build_sched_domains+0x414>
> >> sd = *per_cpu_ptr(sdd->sd, j);
> >> 1288: 93407eb7 sxtw x23, w21
> >> 128c: aa0103e0 mov x0, x1
> >> 1290: f8777ac2 ldr x2, [x22, x23, lsl #3]
> >> *1294: f8626800 ldr x0, [x0, x2]
> >> if (sd && (sd->flags & SD_OVERLAP))
> >> 1298: b4000120 cbz x0, 12bc <build_sched_domains+0x40c>
> >> 129c: b9403803 ldr w3, [x0, #56]
> >> 12a0: 365800e3 tbz w3, #11, 12bc
> >> <build_sched_domains+0x40c>
> >> free_sched_groups(sd->groups, 0);
> >> 12a4: f9400800 ldr x0, [x0, #16]
> >> if (!sg)
> >>
> >
> > Thanks for giving it a shot, let me run that with your topology and see
> > where I end up.
> >
>
> I can't seem to reproduce this - your topology is actually among the ones
> I tested this against.
>
> Applying this patch obviously doesn't get rid of the group span issue, but
> it does remove this warning:
>
> [ 0.354806] ERROR: Node-0 not representative
> [ 0.354806]
> [ 0.355223] 10 12 20 22
> [ 0.355475] 12 10 22 24
> [ 0.355648] 20 22 10 12
> [ 0.355814] 22 24 12 10
>
> I'm running this based on tip/sched/core:
>
> 65bcf072e20e ("sched: Use task_current() instead of 'rq->curr == p'")
I was using 5.11-rc1. One thing I'd like to mention is that:

For the below topology:
+-------+ +-----+
| node1 | 20 |node2|
| +----------+ |
+---+---+ +-----+
| |12
12 | |
+---+---+ +---+-+
| | |node3|
| node0 | | |
+-------+ +-----+

with node0-node2 as 22, node0-node3 as 24, node1-node3 as 22.

I will get the below sched_domains_numa_distance[]:
10, 12, 22, 24
As you can see there is *no* 20. So the node1 and node2 will
only get two-level numa sched_domain:


But for the below topology:
+-------+ +-----+
| node0 | 20 |node2|
| +----------+ |
+---+---+ +-----+
| |12
12 | |
+---+---+ +---+-+
| | |node3|
| node1 | | |
+-------+ +-----+

with node1-node2 as 22, node1-node3 as 24,node0-node3 as 22.

I will get the below sched_domains_numa_distance[]:
10, 12, 20, 22, 24

What I have seen is the performance will be better if we
drop the 20 as we will get a sched_domain hierarchy with less
levels, and two intermediate nodes won't have the group span
issue.

Thanks
Barry