Re: [RFC patch v3 00/20] Cache aware scheduling

From: Tim Chen
Date: Mon Jun 23 2025 - 12:46:32 EST


On Sat, 2025-06-21 at 00:55 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
>
> On 18/06/25 23:57, Tim Chen wrote:
> > This is the third revision of the cache aware scheduling patches,
> > based on the original patch proposed by Peter[1].
> >
> > The goal of the patch series is to aggregate tasks sharing data
> > to the same cache domain, thereby reducing cache bouncing and
> > cache misses, and improve data access efficiency. In the current
> > implementation, threads within the same process are considered
> > as entities that potentially share resources.
> >
> > In previous versions, aggregation of tasks were done in the
> > wake up path, without making load balancing paths aware of
> > LLC (Last-Level-Cache) preference. This led to the following
> > problems:
> >
> > 1) Aggregation of tasks during wake up led to load imbalance
> > between LLCs
> > 2) Load balancing tried to even out the load between LLCs
> > 3) Wake up tasks aggregation happened at a faster rate and
> > load balancing moved tasks in opposite directions, leading
> > to continuous and excessive task migrations and regressions
> > in benchmarks like schbench.
> >
> > In this version, load balancing is made cache-aware. The main
> > idea of cache-aware load balancing consists of two parts:
> >
> > 1) Identify tasks that prefer to run on their hottest LLC and
> > move them there.
> > 2) Prevent generic load balancing from moving a task out of
> > its hottest LLC.
> >
> > By default, LLC task aggregation during wake-up is disabled.
> > Conversely, cache-aware load balancing is enabled by default.
> > For easier comparison, two scheduler features are introduced:
> > SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> > wake up and cache-aware load balancing, respectively. By default,
> > NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> > is only done on load balancing.
>
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.

Hi Madadi,

Thank you for testing this patch series.

If I understand correctly, the Power 11 you tested has 8 threads per core.
My suspicion is we benefit much more from utilizing more cores
than aggregating the load on less cores but sharing the cache
more in this case.


>
> schbench:
> baseline (sd%) baseline+cacheaware (sd%) %change
> Lat 50.0th-worker-1 6.33 (24.12%) 6.00 (28.87%) 5.21%
> Lat 90.0th-worker-1 7.67 ( 7.53%) 7.67 (32.83%) 0.00%
> Lat 99.0th-worker-1 8.67 ( 6.66%) 9.33 (37.63%) -7.61%
> Lat 99.9th-worker-1 21.33 (63.99%) 12.33 (28.47%) 42.19%
>
> Lat 50.0th-worker-2 4.33 (13.32%) 5.67 (10.19%) -30.95%
> Lat 90.0th-worker-2 5.67 (20.38%) 7.67 ( 7.53%) -35.27%
> Lat 99.0th-worker-2 7.33 ( 7.87%) 8.33 ( 6.93%) -13.64%
> Lat 99.9th-worker-2 11.67 (24.74%) 10.33 (11.17%) 11.48%
>
> Lat 50.0th-worker-4 5.00 ( 0.00%) 7.00 ( 0.00%) -40.00%
> Lat 90.0th-worker-4 7.00 ( 0.00%) 9.67 ( 5.97%) -38.14%
> Lat 99.0th-worker-4 8.00 ( 0.00%) 11.33 (13.48%) -41.62%
> Lat 99.9th-worker-4 10.33 ( 5.59%) 14.00 ( 7.14%) -35.53%
>
> Lat 50.0th-worker-8 4.33 (13.32%) 5.67 (10.19%) -30.95%
> Lat 90.0th-worker-8 6.33 (18.23%) 8.67 ( 6.66%) -36.99%
> Lat 99.0th-worker-8 7.67 ( 7.53%) 10.33 ( 5.59%) -34.69%
> Lat 99.9th-worker-8 10.00 (10.00%) 12.33 ( 4.68%) -23.30%
>
> Lat 50.0th-worker-16 4.00 ( 0.00%) 5.00 ( 0.00%) -25.00%
> Lat 90.0th-worker-16 6.33 ( 9.12%) 7.67 ( 7.53%) -21.21%
> Lat 99.0th-worker-16 8.00 ( 0.00%) 10.33 ( 5.59%) -29.13%
> Lat 99.9th-worker-16 12.00 ( 8.33%) 13.33 ( 4.33%) -11.08%
>
> Lat 50.0th-worker-32 5.00 ( 0.00%) 5.33 (10.83%) -6.60%
> Lat 90.0th-worker-32 7.00 ( 0.00%) 8.67 (17.63%) -23.86%
> Lat 99.0th-worker-32 10.67 (14.32%) 12.67 ( 4.56%) -18.75%
> Lat 99.9th-worker-32 14.67 ( 3.94%) 19.00 (13.93%) -29.49%
>
> Lat 50.0th-worker-64 5.33 (10.83%) 6.67 ( 8.66%) -25.14%
> Lat 90.0th-worker-64 10.00 (17.32%) 14.33 ( 4.03%) -43.30%
> Lat 99.0th-worker-64 14.00 ( 7.14%) 16.67 ( 3.46%) -19.07%
> Lat 99.9th-worker-64 55.00 (56.69%) 47.00 (61.92%) 14.55%
>
> Lat 50.0th-worker-128 8.00 ( 0.00%) 8.67 (13.32%) -8.38%
> Lat 90.0th-worker-128 13.33 ( 4.33%) 14.33 ( 8.06%) -7.50%
> Lat 99.0th-worker-128 16.00 ( 0.00%) 20.00 ( 8.66%) -25.00%
> Lat 99.9th-worker-128 2258.33 (83.80%) 2974.67 (21.82%) -31.72%
>
> Lat 50.0th-worker-256 47.67 ( 2.42%) 45.33 ( 3.37%) 4.91%
> Lat 90.0th-worker-256 3470.67 ( 1.88%) 3558.67 ( 0.47%) -2.54%
> Lat 99.0th-worker-256 9040.00 ( 2.76%) 9050.67 ( 0.41%) -0.12%
> Lat 99.9th-worker-256 13824.00 (20.07%) 13104.00 ( 6.84%) 5.21%
>
> The above data shows mostly regression both in the lesser and
> higher load cases.
>
>
> Hackbench pipe:
>
> Pairs Baseline Avg (s) (Std%) Patched Avg (s) (Std%) % Change
> 2 2.987 (1.19%) 2.414 (17.99%) 24.06%
> 4 7.702 (12.53%) 7.228 (18.37%) 6.16%
> 8 14.141 (1.32%) 13.109 (1.46%) 7.29%
> 15 27.571 (6.53%) 29.460 (8.71%) -6.84%
> 30 65.118 (4.49%) 61.352 (4.00%) 5.78%
> 45 105.086 (9.75%) 97.970 (4.26%) 6.77%
> 60 149.221 (6.91%) 154.176 (4.17%) -3.32%
> 75 199.278 (1.21%) 198.680 (1.37%) 0.30%
>
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.
>
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.

My guess is having 8 threads per core, LLC aggregation may have
been too aggressive in consolidating tasks on fewer cores and may have left some
cpu cycles unused. Doing experiments by running one thread per core on Power11
may give us some insights if this conjecture is true.

>
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
>

This patch series is currently tuned for systems with single threaded core,
and having many cores and large cache per LLC.  

With only 4 cores and 32 threads per LLC as in Power 11, we run out of cores quickly
and have more cache contention between the tasks consolidated.
We may have to set aggregation threshold (sysctl_llc_aggr_cap) less
than 50% utilization (default), so we consolidate less aggressively
and spread the tasks much sooner.


Tim