Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

From: Jianyong Wu
Date: Sun Jun 15 2025 - 22:22:35 EST

Next message: Ryan Chen: "RE: [PATCH v0 3/5] arm64: dts: aspeed: Add initial AST2700 SoC device tree"
Previous message: kernel test robot: "[linus:master] [alloc_tag] 780138b123: Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Prateek,

On 5/30/2025 2:09 PM, K Prateek Nayak wrote:

Hello Jianyong,

On 5/29/2025 4:02 PM, Jianyong Wu wrote:

This will happen even when 2 task are located in a cpuset of 16 cpus that shares an LLC. I don't think that it's overloaded for this case.

But if they are located on 2 different CPUs, sched_balance_find_src_rq()
should not return any CPU right? Probably just a timing thing with some
system noise that causes the CPU running the server / client to be
temporarily overloaded.

I've only seen

this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?

I think it's the nohz idle balance that pulls these 2 iperf apart. But the root cause is that load balance doesn't permit even a slight imbalance among LLCs.

Exactly. It's easy to reproduce in those multi-LLCs NUMA system like some AMD servers.

Our solution: Permit controlled load imbalance between LLCs on the same
NUMA node, prioritizing communication affinity over strict balance.

Impact: In a virtual machine with one socket, multiple NUMA nodes (each
with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
seconds as tasks cycled through all four LLCs. With the patch, migrations
stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
thrashing.

Is there any improvement in iperf numbers with these changes?

I observe a bit of improvement with this patch in my test.

I'll also give this series a spin on my end to see if it helps.

Would you mind letting me know if you've had a chance to try it out, or if there's any update on the progress?

Thanks
Jianyong>

Signed-off-by: Jianyong Wu <wujianyong@xxxxxxxx>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..749210e6316b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
          }
#endif
+        /* Allow imbalance between LLCs within a single NUMA node */
+        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
+                && env->sd->parent->flags & SD_NUMA) {

This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.

Perhaps multiple LLCs can be detected using:

     !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

This should have been just

    (sd->child->flags ^ sd->flags) & SD_SHARE_LLC

to find the LLC boundary. Not sure why I prefixed that "!". You also
have to ensure sd itself is not a NUMA domain which is possible with L3
as NUMA option EPYC platforms and Intel with SNC.

Great! Thanks!>

+            int child_weight = env->sd->child->span_weight;
+            int llc_nr = env->sd->span_weight / child_weight;
+            int imb_nr, min;
+
+            if (llc_nr > 1) {
+                /* Let the imbalance not be greater than half of child_weight */
+                min = child_weight >= 4 ? 2 : 1;
+                imb_nr = max_t(int, min, child_weight >> 2);

Isn't this just max_t(int, child_weight >> 2, 1)?

I expect that imb_nr can be 2 when child_weight is 4, as I observe that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
However, this may cause the LLCs a bit overload. I'm not sure if it's a good idea.

My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
working moments.

Next message: Ryan Chen: "RE: [PATCH v0 3/5] arm64: dts: aspeed: Add initial AST2700 SoC device tree"
Previous message: kernel test robot: "[linus:master] [alloc_tag] 780138b123: Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]