Re: [PATCH v2] sched/fair: Use sched_domain_span() for topology_span_sane()

From: Valentin Schneider
Date: Thu Jul 03 2025 - 12:01:47 EST


On 30/06/25 06:10, K Prateek Nayak wrote:
> Leon noted a topology_span_sane() warning in their guest deployment
> starting from v6.16-rc1 [1]. Debug that followed pointed to the
> tl->mask() for the NODE domain being incorrectly resolved to that of the
> highest NUMA domain.
>
> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
> set to the "tl->numa_level" during tl traversal in build_sched_domains()
> calling sd_init() but was not reset before topology_span_sane().
>
> Since "tl->numa_level" still reflected the old value from
> build_sched_domains(), topology_span_sane() for the NODE domain trips
> when the span of the last NUMA domain overlaps.
>
> Instead of replicating the "sched_domains_curr_level" hack, Valentin
> suggested using the spans from the sched_domain objects constructed
> during build_sched_domains() which can also catch overlaps when the
> domain spans are fixed up by build_sched_domain().
>
> The original warning was reproducble on the follwoing NUMA topology
> reported by Leon:
>
> $ sudo numactl -H
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: 2927 MB
> node 0 free: 1603 MB
> node 1 cpus: 2 3
> node 1 size: 3023 MB
> node 1 free: 3008 MB
> node 2 cpus: 4 5
> node 2 size: 3023 MB
> node 2 free: 3007 MB
> node 3 cpus: 6 7
> node 3 size: 3023 MB
> node 3 free: 3002 MB
> node 4 cpus: 8 9
> node 4 size: 3022 MB
> node 4 free: 2718 MB
> node distances:
> node 0 1 2 3 4
> 0: 10 39 38 37 36
> 1: 39 10 38 37 36
> 2: 38 38 10 37 36
> 3: 37 37 37 10 36
> 4: 36 36 36 36 10
>
> The above topology can be mimicked using the following QEMU cmd that was
> used to reproduce the warning and test the fix:
>
> sudo qemu-system-x86_64 -enable-kvm -cpu host \
> -m 20G -smp cpus=10,sockets=10 -machine q35 \
> -object memory-backend-ram,size=4G,id=m0 \
> -object memory-backend-ram,size=4G,id=m1 \
> -object memory-backend-ram,size=4G,id=m2 \
> -object memory-backend-ram,size=4G,id=m3 \
> -object memory-backend-ram,size=4G,id=m4 \
> -numa node,cpus=0-1,memdev=m0,nodeid=0 \
> -numa node,cpus=2-3,memdev=m1,nodeid=1 \
> -numa node,cpus=4-5,memdev=m2,nodeid=2 \
> -numa node,cpus=6-7,memdev=m3,nodeid=3 \
> -numa node,cpus=8-9,memdev=m4,nodeid=4 \
> -numa dist,src=0,dst=1,val=39 \
> -numa dist,src=0,dst=2,val=38 \
> -numa dist,src=0,dst=3,val=37 \
> -numa dist,src=0,dst=4,val=36 \
> -numa dist,src=1,dst=0,val=39 \
> -numa dist,src=1,dst=2,val=38 \
> -numa dist,src=1,dst=3,val=37 \
> -numa dist,src=1,dst=4,val=36 \
> -numa dist,src=2,dst=0,val=38 \
> -numa dist,src=2,dst=1,val=38 \
> -numa dist,src=2,dst=3,val=37 \
> -numa dist,src=2,dst=4,val=36 \
> -numa dist,src=3,dst=0,val=37 \
> -numa dist,src=3,dst=1,val=37 \
> -numa dist,src=3,dst=2,val=37 \
> -numa dist,src=3,dst=4,val=36 \
> -numa dist,src=4,dst=0,val=36 \
> -numa dist,src=4,dst=1,val=36 \
> -numa dist,src=4,dst=2,val=36 \
> -numa dist,src=4,dst=3,val=36 \
> ...
>
> Cc: Steve Wahl <steve.wahl@xxxxxxx>
> Suggested-by: Valentin Schneider <vschneid@xxxxxxxxxx>
> Reported-by: Leon Romanovsky <leon@xxxxxxxxxx>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> ---
> v1..v2:
>
> o Use sched_domain_span() instead of replicating the
> "sched_domains_curr_level" hack (Valentin)
>
> o Included the QEMU cmd in the commit message for the record (Valentin)
>
> v1: https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@xxxxxxx/
>
> Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
> 'x86-build-2025-05-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

Thanks!

Tested-by: Valentin Schneider <vschneid@xxxxxxxxxx>
Reviewed-by: Valentin Schneider <vschneid@xxxxxxxxxx>