RE: 5.11-rc4+git: Shortest NUMA path spans too many nodes

From: Song Bao Hua (Barry Song)
Date: Thu Jan 21 2021 - 16:21:58 EST




> -----Original Message-----
> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx]
> Sent: Friday, January 22, 2021 7:54 AM
> To: Valentin Schneider <valentin.schneider@xxxxxxx>; Meelis Roos
> <mroos@xxxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Vincent Guittot
> <vincent.guittot@xxxxxxxxxx>; Song Bao Hua (Barry Song)
> <song.bao.hua@xxxxxxxxxxxxx>; Mel Gorman <mgorman@xxxxxxx>
> Subject: Re: 5.11-rc4+git: Shortest NUMA path spans too many nodes
>
> On 21/01/2021 19:21, Valentin Schneider wrote:
> > On 21/01/21 19:39, Meelis Roos wrote:
> >>> Could you paste the output of the below?
> >>>
> >>> $ cat /sys/devices/system/node/node*/distance
> >>
> >> 10 12 12 14 14 14 14 16
> >> 12 10 14 12 14 14 12 14
> >> 12 14 10 14 12 12 14 14
> >> 14 12 14 10 12 12 14 14
> >> 14 14 12 12 10 14 12 14
> >> 14 14 12 12 14 10 14 12
> >> 14 12 14 14 12 14 10 12
> >> 16 14 14 14 14 12 12 10
> >>
> >
> > Thanks!
> >
> >>
> >>> Additionally, booting your system with CONFIG_SCHED_DEBUG=y and
> >>> appending 'sched_debug' to your cmdline should yield some extra data.
> >>
> >> [ 0.000000] Linux version 5.11.0-rc4-00015-g45dfb8a5659a (mroos@x4600m2)
> (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.1)
> #55 SMP Thu Jan 21 19:23:10 EET 2021
> >> [ 0.000000] Command line:
> BOOT_IMAGE=/boot/vmlinuz-5.11.0-rc4-00015-g45dfb8a5659a root=/dev/sda1 ro
> quiet
> >
> > This is missing 'sched_debug' to get the extra topology debug prints (yes
> > it needs an extra cmdline argument on top of having CONFIG_SCHED_DEBUG=y),
> > but I should be able to generate those locally by feeding QEMU the above
> > distance table.
>
> Can be recreated with (simplified with only 1 CPU per node):
>
> $ qemu-system-aarch64 -kernel /opt/git/kernel_org/arch/arm64/boot/Image -hda
> /opt/git/tools/qemu-imgs-manipulator/images/qemu-image-aarch64.img -append
> 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -nographic -machine
> virt,gic-version=max -smp cores=8 -m 512 -cpu cortex-a57 -numa
> node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, -numa node,cpus=2,nodeid=2,
> -numa node,cpus=3,nodeid=3, -numa node,cpus=4,nodeid=4, -numa
> node,cpus=5,nodeid=5, -numa node,cpus=6,nodeid=6, -numa node,cpus=7,nodeid=7,
> -numa dist,src=0,dst=1,val=12, -numa dist,src=0,dst=2,val=12, -numa
> dist,src=0,dst=3,val=14, -numa dist,src=0,dst=4,val=14, -numa
> dist,src=0,dst=5,val=14, -numa dist,src=0,dst=6,val=14, -numa
> dist,src=0,dst=7,val=16, -numa dist,src=1,dst=2,val=14, -numa
> dist,src=1,dst=3,val=12, -numa dist,src=1,dst=4,val=14, -numa
> dist,src=1,dst=5,val=14, -numa dist,src=1,dst=6,val=12, -numa
> dist,src=1,dst=7,val=14, -numa dist,src=2,dst=3,val=14, -numa
> dist,src=2,dst=4,val=12, -numa dist,src=2,dst=5,val=12, -numa
> dist,src=2,dst=6,val=14, -numa dist,src=2,dst=7,val=14, -numa
> dist,src=3,dst=4,val=12, -numa dist,src=3,dst=5,val=12, -numa
> dist,src=3,dst=6,val=14, -numa dist,src=3,dst=7,val=14, -numa
> dist,src=4,dst=5,val=14, -numa dist,src=4,dst=6,val=12, -numa
> dist,src=4,dst=7,val=14, -numa dist,src=5,dst=6,val=14, -numa
> dist,src=5,dst=7,val=12, -numa dist,src=6,dst=7,val=12
>
> [ 0.206628] ------------[ cut here ]------------
> [ 0.206698] Shortest NUMA path spans too many nodes
> [ 0.207119] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:753
> cpu_attach_domain+0x42c/0x87c
> [ 0.207176] Modules linked in:
> [ 0.207373] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> 5.11.0-rc2-00010-g65bcf072e20e-dirty #81
> [ 0.207458] Hardware name: linux,dummy-virt (DT)
> [ 0.207584] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> [ 0.207618] pc : cpu_attach_domain+0x42c/0x87c
> [ 0.207646] lr : cpu_attach_domain+0x42c/0x87c
> [ 0.207665] sp : ffff800011fcbbf0
> [ 0.207679] x29: ffff800011fcbbf0 x28: ffff0000024d8200
> [ 0.207735] x27: 0000000000001fef x26: 0000000000001917
> [ 0.207755] x25: ffff0000024d8000 x24: 0000000000001917
> [ 0.207772] x23: 0000000000000000 x22: ffff800011b69a40
> [ 0.207789] x21: ffff0000024d8320 x20: ffff8000116fda80
> [ 0.207806] x19: ffff0000024d8000 x18: 0000000000000000
> [ 0.207822] x17: 0000000000000000 x16: 00000000bd30d762
> [ 0.207838] x15: 0000000000000030 x14: ffffffffffffffff
> [ 0.207855] x13: ffff800011b82e08 x12: 00000000000001b9
> [ 0.207871] x11: 0000000000000093 x10: ffff800011bdae08
> [ 0.207887] x9 : 00000000fffff000 x8 : ffff800011b82e08
> [ 0.207922] x7 : ffff800011bdae08 x6 : 0000000000000000
> [ 0.207939] x5 : 0000000000000000 x4 : 0000000000000000
> [ 0.207955] x3 : 00000000ffffffff x2 : 0000000000000000
> [ 0.207972] x1 : 0000000000000000 x0 : ffff000018020000
> [ 0.208125] Call trace:
> [ 0.208230] cpu_attach_domain+0x42c/0x87c
> [ 0.208256] build_sched_domains+0x1238/0x12f4
> [ 0.208271] sched_init_domains+0x80/0xb0
> [ 0.208283] sched_init_smp+0x30/0x80
> [ 0.208299] kernel_init_freeable+0xf4/0x238
> [ 0.208313] kernel_init+0x14/0x118
> [ 0.208328] ret_from_fork+0x10/0x34
> [ 0.208507] ---[ end trace 75cafa7c7d1a3d7e ]---
> [ 0.208706] CPU0 attaching sched-domain(s):
> [ 0.208756] domain-0: span=0-2 level=NUMA
> [ 0.209001] groups: 0:{ span=0 cap=1017 }, 1:{ span=1 cap=1016 }, 2:{ span=2
> cap=1015 }
> [ 0.209247] domain-1: span=0-6 level=NUMA
> [ 0.209280] groups: 0:{ span=0-2 mask=0 cap=3048 }, 3:{ span=1,3-5 mask=3
> cap=4073 }, 6:{ span=1,4,6-7 mask=6 cap=4084 }
> [ 0.209693] ERROR: groups don't span domain->span
> [ 0.209703] domain-2: span=0-7 level=NUMA
> [ 0.209722] groups: 0:{ span=0-6 mask=0 cap=7114 }, 7:{ span=1-7 mask=7
> cap=7163 }
> [ 0.210361] CPU1 attaching sched-domain(s):
> [ 0.210376] domain-0: span=0-1,3,6 level=NUMA
> [ 0.210411] groups: 1:{ span=1 cap=1016 }, 3:{ span=3 cap=1018 }, 6:{ span=6
> cap=1017 }, 0:{ span=0 cap=1017 }
> [ 0.210493] domain-1: span=0-7 level=NUMA
> [ 0.210511] groups: 1:{ span=0-1,3,6 mask=1 cap=4075 }, 2:{ span=0,2,4-5
> mask=2 cap=4070 }, 7:{ span=5-7 mask=7 cap=3067 }
> [ 0.210641] CPU2 attaching sched-domain(s):
> [ 0.210653] domain-0: span=0,2,4-5 level=NUMA
> [ 0.210672] groups: 2:{ span=2 cap=1015 }, 4:{ span=4 cap=1016 }, 5:{ span=5
> cap=1015 }, 0:{ span=0 cap=1017 }
> [ 0.210752] domain-1: span=0-7 level=NUMA
> [ 0.210769] groups: 2:{ span=0,2,4-5 mask=2 cap=4070 }, 3:{ span=1,3-5
> mask=3 cap=4073 }, 6:{ span=1,4,6-7 mask=6 cap=4084 }
> [ 0.210860] CPU3 attaching sched-domain(s):
> [ 0.210870] domain-0: span=1,3-5 level=NUMA
> [ 0.210887] groups: 3:{ span=3 cap=1018 }, 4:{ span=4 cap=1016 }, 5:{ span=5
> cap=1015 }, 1:{ span=1 cap=1016 }
> [ 0.210965] domain-1: span=0-7 level=NUMA
> [ 0.210981] groups: 3:{ span=1,3-5 mask=3 cap=4073 }, 6:{ span=1,4,6-7
> mask=6 cap=4084 }, 0:{ span=0-2 mask=0 cap=3048 }
> [ 0.211109] CPU4 attaching sched-domain(s):
> [ 0.211134] domain-0: span=2-4,6 level=NUMA
> [ 0.211151] groups: 4:{ span=4 cap=1016 }, 6:{ span=6 cap=1017 }, 2:{ span=2
> cap=1015 }, 3:{ span=3 cap=1018 }
> [ 0.211229] domain-1: span=0-7 level=NUMA
> [ 0.211245] groups: 4:{ span=2-4,6 mask=4 cap=4081 }, 5:{ span=2-3,5,7
> mask=5 cap=4082 }, 0:{ span=0-2 mask=0 cap=3048 }
> [ 0.211383] CPU5 attaching sched-domain(s):
> [ 0.211393] domain-0: span=2-3,5,7 level=NUMA
> [ 0.211425] groups: 5:{ span=5 cap=1015 }, 7:{ span=7 cap=1019 }, 2:{ span=2
> cap=1015 }, 3:{ span=3 cap=1018 }
> [ 0.211506] domain-1: span=0-7 level=NUMA
> [ 0.211524] groups: 5:{ span=2-3,5,7 mask=5 cap=4082 }, 6:{ span=1,4,6-7
> mask=6 cap=4084 }, 0:{ span=0-2 mask=0 cap=3048 }
> [ 0.211618] CPU6 attaching sched-domain(s):
> [ 0.211628] domain-0: span=1,4,6-7 level=NUMA
> [ 0.211645] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1019 }, 1:{ span=1
> cap=1016 }, 4:{ span=4 cap=1016 }
> [ 0.211728] domain-1: span=0-7 level=NUMA
> [ 0.211745] groups: 6:{ span=1,4,6-7 mask=6 cap=4084 }, 0:{ span=0-2 mask=0
> cap=3048 }, 3:{ span=1,3-5 mask=3 cap=4073 }
> [ 0.211855] CPU7 attaching sched-domain(s):
> [ 0.211866] domain-0: span=5-7 level=NUMA
> [ 0.211884] groups: 7:{ span=7 cap=1019 }, 5:{ span=5 cap=1015 }, 6:{ span=6
> cap=1017 }
> [ 0.211949] domain-1: span=1-7 level=NUMA
> [ 0.211966] groups: 7:{ span=5-7 mask=7 cap=3067 }, 1:{ span=0-1,3,6 mask=1
> cap=4075 }, 2:{ span=0,2,4-5 mask=2 cap=4070 }
> [ 0.212047] ERROR: groups don't span domain->span
> [ 0.212055] domain-2: span=0-7 level=NUMA
> [ 0.212072] groups: 7:{ span=1-7 mask=7 cap=7163 }, 0:{ span=0-6 mask=0
> cap=7114 }
>
> # cat /sys/devices/system/node/node*/distance
> 10 12 12 14 14 14 14 16
> 12 10 14 12 14 14 12 14
> 12 14 10 14 12 12 14 14
> 14 12 14 10 12 12 14 14
> 14 14 12 12 10 14 12 14
> 14 14 12 12 14 10 14 12
> 14 12 14 14 12 14 10 12
> 16 14 14 14 14 12 12 10
>
> The '16' seems to be the culprit. How does such a topo look like?

Once we get a topology like this:


+------+ +------+ +-------+ +------+
| node | |node | | node | |node |
| +---------+ +--------+ +-------+ |
+------+ +------+ +-------+ +------+

We can reproduce this issue.
For example, every cpu with the below numa_distance can have
"groups don't span domain->span":
node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10

Qemu:
qemu-system-aarch64 -M virt -nographic \
-smp cpus=8 \
-numa node,cpus=0-1,nodeid=0 \
-numa node,cpus=2-3,nodeid=1 \
-numa node,cpus=4-5,nodeid=2 \
-numa node,cpus=6-7,nodeid=3 \
-numa dist,src=0,dst=1,val=12 \
-numa dist,src=0,dst=2,val=20 \
-numa dist,src=0,dst=3,val=22 \
-numa dist,src=1,dst=2,val=22 \
-numa dist,src=2,dst=3,val=12 \
-numa dist,src=1,dst=3,val=24 \

Boot log:
[ 0.834496] CPU0 attaching sched-domain(s):
[ 0.834546] domain-0: span=0-1 level=MC
[ 0.834754] groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=970 }
[ 0.835018] domain-1: span=0-3 level=NUMA
[ 0.835052] groups: 0:{ span=0-1 cap=1981 }, 2:{ span=2-3 cap=1997 }
[ 0.835128] domain-2: span=0-5 level=NUMA
[ 0.835144] groups: 0:{ span=0-3 cap=3978 }, 4:{ span=4-7 cap=3864 }
[ 0.835195] ERROR: groups don't span domain->span
[ 0.835206] domain-3: span=0-7 level=NUMA
[ 0.835222] groups: 0:{ span=0-5 mask=0-1 cap=5933 }, 6:{
span=4-7 mask=6-7 cap=3957 }
[ 0.835959] CPU1 attaching sched-domain(s):
[ 0.835974] domain-0: span=0-1 level=MC
[ 0.835996] groups: 1:{ span=1 cap=970 }, 0:{ span=0 cap=1011 }
[ 0.836049] domain-1: span=0-3 level=NUMA
[ 0.836065] groups: 0:{ span=0-1 cap=1981 }, 2:{ span=2-3 cap=1997 }
[ 0.836114] domain-2: span=0-5 level=NUMA
[ 0.836130] groups: 0:{ span=0-3 cap=3978 }, 4:{ span=4-7 cap=3864 }
[ 0.836178] ERROR: groups don't span domain->span
[ 0.836188] domain-3: span=0-7 level=NUMA
[ 0.836204] groups: 0:{ span=0-5 mask=0-1 cap=5933 }, 6:{
span=4-7 mask=6-7 cap=3957 }
[ 0.836290] CPU2 attaching sched-domain(s):
[ 0.836299] domain-0: span=2-3 level=MC
[ 0.836316] groups: 2:{ span=2 cap=983 }, 3:{ span=3 cap=1014 }
[ 0.836364] domain-1: span=0-3 level=NUMA
[ 0.836379] groups: 2:{ span=2-3 cap=1997 }, 0:{ span=0-1 cap=1981 }
[ 0.836427] domain-2: span=0-5 level=NUMA
[ 0.836442] groups: 2:{ span=0-3 mask=2-3 cap=4045 }, 4:{
span=0-1,4-7 mask=4-5 cap=5912 }
[ 0.836538] ERROR: groups don't span domain->span
[ 0.836549] domain-3: span=0-7 level=NUMA
[ 0.836580] groups: 2:{ span=0-5 mask=2-3 cap=6000 }, 6:{
span=0-1,4-7 mask=6-7 cap=6005 }
[ 0.836667] CPU3 attaching sched-domain(s):
[ 0.836675] domain-0: span=2-3 level=MC
[ 0.836690] groups: 3:{ span=3 cap=1014 }, 2:{ span=2 cap=983 }
[ 0.836734] domain-1: span=0-3 level=NUMA
[ 0.836749] groups: 2:{ span=2-3 cap=1997 }, 0:{ span=0-1 cap=1981 }
[ 0.836793] domain-2: span=0-5 level=NUMA
[ 0.836822] groups: 2:{ span=0-3 mask=2-3 cap=4045 }, 4:{
span=0-1,4-7 mask=4-5 cap=5912 }
[ 0.836879] ERROR: groups don't span domain->span
[ 0.836888] domain-3: span=0-7 level=NUMA
[ 0.836903] groups: 2:{ span=0-5 mask=2-3 cap=6000 }, 6:{
span=0-1,4-7 mask=6-7 cap=6005 }
[ 0.836975] CPU4 attaching sched-domain(s):
[ 0.836982] domain-0: span=4-5 level=MC
[ 0.836997] groups: 4:{ span=4 cap=945 }, 5:{ span=5 cap=1010 }
[ 0.837041] domain-1: span=4-7 level=NUMA
[ 0.837057] groups: 4:{ span=4-5 cap=1955 }, 6:{ span=6-7 cap=1909 }
[ 0.837102] domain-2: span=0-1,4-7 level=NUMA
[ 0.837117] groups: 4:{ span=4-7 cap=3864 }, 0:{ span=0-3 cap=3978 }
[ 0.837161] ERROR: groups don't span domain->span
[ 0.837170] domain-3: span=0-7 level=NUMA
[ 0.837185] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5912 }, 2:{
span=0-3 mask=2-3 cap=4045 }
[ 0.837252] CPU5 attaching sched-domain(s):
[ 0.837260] domain-0: span=4-5 level=MC
[ 0.837275] groups: 5:{ span=5 cap=1010 }, 4:{ span=4 cap=945 }
[ 0.837320] domain-1: span=4-7 level=NUMA
[ 0.837334] groups: 4:{ span=4-5 cap=1955 }, 6:{ span=6-7 cap=1909 }
[ 0.837378] domain-2: span=0-1,4-7 level=NUMA
[ 0.837393] groups: 4:{ span=4-7 cap=3864 }, 0:{ span=0-3 cap=3978 }
[ 0.837437] ERROR: groups don't span domain->span
[ 0.837445] domain-3: span=0-7 level=NUMA
[ 0.837460] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5912 }, 2:{
span=0-3 mask=2-3 cap=4045 }
[ 0.837552] CPU6 attaching sched-domain(s):
[ 0.837560] domain-0: span=6-7 level=MC
[ 0.837576] groups: 6:{ span=6 cap=1002 }, 7:{ span=7 cap=907 }
[ 0.837621] domain-1: span=4-7 level=NUMA
[ 0.837635] groups: 6:{ span=6-7 cap=1909 }, 4:{ span=4-5 cap=1955 }
[ 0.837679] domain-2: span=0-1,4-7 level=NUMA
[ 0.837695] groups: 6:{ span=4-7 mask=6-7 cap=3957 }, 0:{
span=0-5 mask=0-1 cap=5933 }
[ 0.837749] ERROR: groups don't span domain->span
[ 0.837758] domain-3: span=0-7 level=NUMA
[ 0.837774] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6005 }, 2:{
span=0-5 mask=2-3 cap=6000 }
[ 0.838055] CPU7 attaching sched-domain(s):
[ 0.838066] domain-0: span=6-7 level=MC
[ 0.838086] groups: 7:{ span=7 cap=907 }, 6:{ span=6 cap=1002 }
[ 0.838135] domain-1: span=4-7 level=NUMA
[ 0.838151] groups: 6:{ span=6-7 cap=1909 }, 4:{ span=4-5 cap=1955 }
[ 0.838198] domain-2: span=0-1,4-7 level=NUMA
[ 0.838214] groups: 6:{ span=4-7 mask=6-7 cap=3957 }, 0:{
span=0-5 mask=0-1 cap=5933 }
[ 0.838272] ERROR: groups don't span domain->span
[ 0.838282] domain-3: span=0-7 level=NUMA
[ 0.838298] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6005 }, 2:{
span=0-5 mask=2-3 cap=6000 }
[ 0.838414] root domain span: 0-7 (max cpu_capacity = 1024)

Thanks
Barry