Re: [RFC patch v3 00/20] Cache aware scheduling

From: Tim Chen
Date: Tue Jun 24 2025 - 20:30:54 EST


On Tue, 2025-06-24 at 10:30 +0530, K Prateek Nayak wrote:
> Hello Tim,
>
> I have similar observation from my testing.
>
>
Prateek,

Thanks for the testing that you did. Much appreciated.
Some follow up to Chen, Yu's comments.

>
> o Benchmark that prefer co-location and run in threaded mode see
> a benefit including hackbench at high utilization and schbench
> at low utilization.
>
> o schbench (both new and old but particularly the old) regresses
> quite a bit on the tial latency metric when #workers cross the
> LLC size.

Will take closer look at the cases where #workers just exceed LLC size.
Perhaps adjusting the threshold to spread the load earlier at a
lower LLC utilization will help.

>
> o client-server benchmarks where client and servers are threads
> from different processes (netserver-netperf, tbench_srv-tbench,
> services of DeathStarBench) seem to noticeably regress due to
> lack of co-location between the communicating client and server.
>
> Not sure if WF_SYNC can be an indicator to temporarily ignore
> the preferred LLC hint.

Currently we do not aggregate tasks from different processes.
The case where client and server actually reside on the same
system I think is the exception rather than the rule for real
workloads where clients and servers reside on different systems.

But I do see tasks from different processes talking to each
other via pipe/socket in real workload. Do you know of good
use cases for such scenario that would justify extending task
aggregation to multi-processes?

>
> o stream regresses in some runs where the occupancy metrics trip
> and assign a preferred LLC for all the stream threads bringing
> down performance in !50% of the runs.
>

Yes, stream does not have cache benefit from co-locating threads, and
get hurt from sharing common resource like memory controller.


> Full data from my testing is as follows:
>
> o Machine details
>
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: %diff
> ycsb-cassandra -0.99%
> ycsb-mongodb -0.96%
> deathstarbench-1x -2.09%
> deathstarbench-2x -0.26%
> deathstarbench-3x -3.34%
> deathstarbench-6x -3.03%
> hammerdb+mysql 16VU -2.15%
> hammerdb+mysql 64VU -3.77%
>

The clients and server of the benchmarks are co-located on the same
system, right?

> >
> > This patch set is applied on v6.15 kernel.
> >
> > There are some further work needed for future versions in this
> > patch set. We will need to align NUMA balancing with LLC aggregations
> > such that LLC aggregation will align with the preferred NUMA node.
> >
> > Comments and tests are much appreciated.
>
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.
>
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
>
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
>

Will do. Thanks.

Tim