Re: [RESEND PATCH] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

From: Shrikanth Hegde

Date: Mon Oct 13 2025 - 12:41:25 EST

On 10/13/25 10:02 PM, Chen, Yu C wrote:

On 10/13/2025 10:26 PM, Peter Zijlstra wrote:

On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:

During load balancing, balancing at the LLC level and above must be
serialized.

I would argue the wording here, there is no *must*, they *are*. Per the
current rules SD_NUMA and up get SD_SERIALIZE.

This is a *very* old thing, done by Christoph Lameter back when he was
at SGI. I'm not sure this default is still valid or not. Someone would
have to investigate. An alternative would be moving it into
node_reclaim_distance or somesuch.

Do you mean the following:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 444bdfdab731..436c899d8da2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1697,11 +1697,16 @@ sd_init(struct sched_domain_topology_level *tl,
                sd->cache_nice_tries = 2;

                sd->flags &= ~SD_PREFER_SIBLING;
-               sd->flags |= SD_SERIALIZE;
                if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
                        sd->flags &= ~(SD_BALANCE_EXEC |
                                       SD_BALANCE_FORK |
                                       SD_WAKE_AFFINE);
+                       /*
+                        * Nodes that are far away need to be serialized to
+                        * reduce the overhead of long-distance task migration
+                        * caused by load balancing.
+                        */
+                       sd->flags |= SD_SERIALIZE;
                }

We can launch some tests to see if removing SD_SERIALIZE would
bring any impact.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
operations for `sched_balance_running`. In most cases, the attempt
aborts immediately after acquisition because the load balance time is
not yet due.

So I'm not sure I understand the situation, @continue_balancing should
limit this concurrency to however many groups are on this domain -- your
granite thing with SNC on would have something like 6 groups?

My understanding is that, continue_balancing is set to false after
atomic_cmpxhg(sched_balance_running), so if sched_balance_domains()
is invoked by many CPUs in parallel, the atomic operation still compete?

From what i could remember,

This mostly always happens at SMT after which continue_balancing = 0.
Since multiple CPUs end up calling it (specially on busy system)
it causes a lot cacheline bouncing. and ends up showing in perf profile.