Re: [RFC patch v3 00/20] Cache aware scheduling

From: K Prateek Nayak
Date: Wed Jun 25 2025 - 00:19:27 EST


Hello Chenyu,

On 6/24/2025 5:46 PM, Chen, Yu C wrote:

[..snip..]

tl;dr

o Benchmark that prefer co-location and run in threaded mode see
   a benefit including hackbench at high utilization and schbench
   at low utilization.


Previously, we tested hackbench with one group using different
fd pairs. The number of fds (1–6) was lower than the number
of CPUs (8) within one CCX. If I understand correctly, the
default number of fd pairs in hackbench is 20.

Yes that is correct. I'm using the default configuration with
20 messengers are 20 receivers over 20 fd pairs. I'll check
if changing this to nr_llc and nr_llc / 2 makes a difference.

We might need
to handle cases where the number of threads (nr_thread)
exceeds the number of CPUs per LLC—perhaps by
skipping task aggregation in such scenarios.

o schbench (both new and old but particularly the old) regresses
   quite a bit on the tial latency metric when #workers cross the
   LLC size.


As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
could mitigate the issue. Besides, maybe introduce a rate limit
for cache aware aggregation would help.

o client-server benchmarks where client and servers are threads
   from different processes (netserver-netperf, tbench_srv-tbench,
   services of DeathStarBench) seem to noticeably regress due to
   lack of co-location between the communicating client and server.

   Not sure if WF_SYNC can be an indicator to temporarily ignore
   the preferred LLC hint.

WF_SYNC is used in wakeup path, the current v3 version does the
task aggregation in the load balance path. We'll look into this
C/S scenario.


o stream regresses in some runs where the occupancy metrics trip
   and assign a preferred LLC for all the stream threads bringing
   down performance in !50% of the runs.


May I know if you tested the stream with mmtests under OMP mode,
and what do stream-10 and stream-100 mean?

I'm using STREAM in OMP mode. The "10" and "100" refer to the
"NTIMES" argument. I'm passing this during the time of binary
creation as:

gcc -DSTREAM_ARRAY_SIZE=$ARRAY_SIZE -DNTIMES=$NUM_TIMES -fopenmp -O2 stream.c -o stream

This repeats the main loop of stream benchmark NTIMES. 10 runs
is used to spot any imbalances for shorter runs of b/w intensive
tasks and 100 runs are used to spot trends / ability to correct
an incorrect placement over a longer run.

Stream is an example
where all threads have their private memory buffers—no
interaction with each other. For this benchmark, spreading
them across different Nodes gets higher memory bandwidth because
stream allocates the buffer to be at least 4X the L3 cache size.
We lack a metric that can indicate when threads share a lot of
data (e.g., both Thread 1 and Thread 2 read from the same
buffer). In such cases, we should aggregate the threads;
otherwise, do not aggregate them (as in the stream case).
On the other hand, stream-omp seems like an unrealistic
scenario—if threads do not share buffer, why create them
in the same process?

Not very sure why that is the case but from what I know, HPC
heavily relies on OMP and I believe using threads can reduce
the overhead of fork + join when amount of parallelism in
OMP loops vary?

[..snip..]



This patch set is applied on v6.15 kernel.
There are some further work needed for future versions in this
patch set.  We will need to align NUMA balancing with LLC aggregations
such that LLC aggregation will align with the preferred NUMA node.

Comments and tests are much appreciated.

I'll rerun the test once with the SCHED_FEAT() disabled just to make
sure I'm not regressing because of some other factors. For the major
regressions, I'll get the "perf sched stats" data to see if anything
stands out.

It seems that task migration and task bouncing between its preferred
LLC and non-preferred LLC is one symptom that caused regression.

That could be the case! I'll also include some migration data to see
if it reveals anything.

--
Thanks and Regards,
Prateek