Hello Jianyong,
On 5/29/2025 4:02 PM, Jianyong Wu wrote:
This will happen even when 2 task are located in a cpuset of 16 cpus that shares an LLC. I don't think that it's overloaded for this case.
But if they are located on 2 different CPUs, sched_balance_find_src_rq()
should not return any CPU right? Probably just a timing thing with some
system noise that causes the CPU running the server / client to be
temporarily overloaded.
I've only seen
this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?
I think it's the nohz idle balance that pulls these 2 iperf apart. But the root cause is that load balance doesn't permit even a slight imbalance among LLCs.
Exactly. It's easy to reproduce in those multi-LLCs NUMA system like some AMD servers.
I observe a bit of improvement with this patch in my test.
Our solution: Permit controlled load imbalance between LLCs on the same
NUMA node, prioritizing communication affinity over strict balance.
Impact: In a virtual machine with one socket, multiple NUMA nodes (each
with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
seconds as tasks cycled through all four LLCs. With the patch, migrations
stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
thrashing.
Is there any improvement in iperf numbers with these changes?
I'll also give this series a spin on my end to see if it helps.
Signed-off-by: Jianyong Wu <wujianyong@xxxxxxxx>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..749210e6316b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}
#endif
+ /* Allow imbalance between LLCs within a single NUMA node */
+ if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
+ && env->sd->parent->flags & SD_NUMA) {
This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.
Perhaps multiple LLCs can be detected using:
!((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
This should have been just
(sd->child->flags ^ sd->flags) & SD_SHARE_LLC
to find the LLC boundary. Not sure why I prefixed that "!". You also
have to ensure sd itself is not a NUMA domain which is possible with L3
as NUMA option EPYC platforms and Intel with SNC.
Great! Thanks!>
+ int child_weight = env->sd->child->span_weight;
+ int llc_nr = env->sd->span_weight / child_weight;
+ int imb_nr, min;
+
+ if (llc_nr > 1) {
+ /* Let the imbalance not be greater than half of child_weight */
+ min = child_weight >= 4 ? 2 : 1;
+ imb_nr = max_t(int, min, child_weight >> 2);
Isn't this just max_t(int, child_weight >> 2, 1)?
I expect that imb_nr can be 2 when child_weight is 4, as I observe that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
However, this may cause the LLCs a bit overload. I'm not sure if it's a good idea.
My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
working moments.