Re: [PATCH] sched: Skip useless sched_balance_running acquisition if load balance is not due

From: K Prateek Nayak
Date: Fri Apr 18 2025 - 01:26:28 EST


Hello Peter,

On 4/17/2025 5:31 PM, Peter Zijlstra wrote:
o Since this is a single flag across the entire system, it also implies
CPUs cannon concurrently do load balancing across different NUMA
domains which seems reasonable since a load balance at lower NUMA
domain can potentially change the "nr_numa_running" and
"nr_preferred_running" stats for the higher domain but if this is the
case, a newidle balance at lower NUMA domain can interfere with a
concurrent busy / newidle load balancing at higher NUMA domain.
Is this expected? Should newidle balance be serialized too?

Serializing new-idle might create too much idle time.

In the context of busy and idle balancing, What are your thoughts on a
per sd "serialize' flag?


(P.S. I copied over the serialize logic from sched_balance_domains()
into sched_balance_newidle() and did not see any difference in my
testing but perhaps there are benchmarks out there that care for
this)

o If the intention of SD_SERIALIZE was to actually "serializes
load-balancing passes over large domains (above the NODE topology
level)" as the comment above "sched_balance_running" states, and
this question is specific to x86 - when enabling SNC on Intel or
NPS on AMD servers, the first NUMA domain is in fact as big as the
NODE (now PKG domain) if not smaller. Is it okay to clear
SD_SERIALIZE for these domains since they are small enough now?

You'll have to dive into the history here, but IIRC it was from SGI back
in the day, where NUMA factors were quite large and load-balancing
across numa was a pain.

Let me dig up the git history and see if any interesting details hide
there.


Small really isn't the criteria, but inter-node latency might be, we
also have this node_reclaim_distance thing.

Not quite sure what makes sense, someone should tinker I suppose, see
what works with today's hardare.

I'll try some experiments over the weekend to see if my machine turns
up happy with non-serialized lb for inter-PKG load balancing with NPS
turned on. I'll probably piggy back off of "node_reclaim_distance"
heuristics.

--
Thanks and Regards,
Prateek