Re: [RFC patch v3 00/20] Cache aware scheduling

From: Chen, Yu C
Date: Sat Jun 21 2025 - 20:40:21 EST

Next message: Sebastian Reichel: "Re: [PATCH v7 5/6] power: supply: pf1550: add battery charger support"
Previous message: Daniel Sullivan: "[PATCH] gpio: ts5500: use new GPIO line value setter callbacks"
In reply to: Madadi Vineeth Reddy: "Re: [RFC patch v3 00/20] Cache aware scheduling"
Next in thread: Tim Chen: "Re: [RFC patch v3 00/20] Cache aware scheduling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:

Hi Tim,

On 18/06/25 23:57, Tim Chen wrote:

This is the third revision of the cache aware scheduling patches,
based on the original patch proposed by Peter[1].
The goal of the patch series is to aggregate tasks sharing data
to the same cache domain, thereby reducing cache bouncing and
cache misses, and improve data access efficiency. In the current
implementation, threads within the same process are considered
as entities that potentially share resources.
In previous versions, aggregation of tasks were done in the
wake up path, without making load balancing paths aware of
LLC (Last-Level-Cache) preference. This led to the following
problems:

1) Aggregation of tasks during wake up led to load imbalance
between LLCs
2) Load balancing tried to even out the load between LLCs
3) Wake up tasks aggregation happened at a faster rate and
load balancing moved tasks in opposite directions, leading
to continuous and excessive task migrations and regressions
in benchmarks like schbench.

In this version, load balancing is made cache-aware. The main
idea of cache-aware load balancing consists of two parts:

1) Identify tasks that prefer to run on their hottest LLC and
move them there.
2) Prevent generic load balancing from moving a task out of
its hottest LLC.

By default, LLC task aggregation during wake-up is disabled.
Conversely, cache-aware load balancing is enabled by default.
For easier comparison, two scheduler features are introduced:
SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
wake up and cache-aware load balancing, respectively. By default,
NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
is only done on load balancing.

Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
LLC on this platform spans 4 threads.

schbench:
baseline (sd%) baseline+cacheaware (sd%) %change
Lat 50.0th-worker-1 6.33 (24.12%) 6.00 (28.87%) 5.21%
Lat 90.0th-worker-1 7.67 ( 7.53%) 7.67 (32.83%) 0.00%
Lat 99.0th-worker-1 8.67 ( 6.66%) 9.33 (37.63%) -7.61%
Lat 99.9th-worker-1 21.33 (63.99%) 12.33 (28.47%) 42.19%

Lat 50.0th-worker-2 4.33 (13.32%) 5.67 (10.19%) -30.95%
Lat 90.0th-worker-2 5.67 (20.38%) 7.67 ( 7.53%) -35.27%
Lat 99.0th-worker-2 7.33 ( 7.87%) 8.33 ( 6.93%) -13.64%
Lat 99.9th-worker-2 11.67 (24.74%) 10.33 (11.17%) 11.48%

Lat 50.0th-worker-4 5.00 ( 0.00%) 7.00 ( 0.00%) -40.00%
Lat 90.0th-worker-4 7.00 ( 0.00%) 9.67 ( 5.97%) -38.14%
Lat 99.0th-worker-4 8.00 ( 0.00%) 11.33 (13.48%) -41.62%
Lat 99.9th-worker-4 10.33 ( 5.59%) 14.00 ( 7.14%) -35.53%

Lat 50.0th-worker-8 4.33 (13.32%) 5.67 (10.19%) -30.95%
Lat 90.0th-worker-8 6.33 (18.23%) 8.67 ( 6.66%) -36.99%
Lat 99.0th-worker-8 7.67 ( 7.53%) 10.33 ( 5.59%) -34.69%
Lat 99.9th-worker-8 10.00 (10.00%) 12.33 ( 4.68%) -23.30%

Lat 50.0th-worker-16 4.00 ( 0.00%) 5.00 ( 0.00%) -25.00%
Lat 90.0th-worker-16 6.33 ( 9.12%) 7.67 ( 7.53%) -21.21%
Lat 99.0th-worker-16 8.00 ( 0.00%) 10.33 ( 5.59%) -29.13%
Lat 99.9th-worker-16 12.00 ( 8.33%) 13.33 ( 4.33%) -11.08%

Lat 50.0th-worker-32 5.00 ( 0.00%) 5.33 (10.83%) -6.60%
Lat 90.0th-worker-32 7.00 ( 0.00%) 8.67 (17.63%) -23.86%
Lat 99.0th-worker-32 10.67 (14.32%) 12.67 ( 4.56%) -18.75%
Lat 99.9th-worker-32 14.67 ( 3.94%) 19.00 (13.93%) -29.49%

Lat 50.0th-worker-64 5.33 (10.83%) 6.67 ( 8.66%) -25.14%
Lat 90.0th-worker-64 10.00 (17.32%) 14.33 ( 4.03%) -43.30%
Lat 99.0th-worker-64 14.00 ( 7.14%) 16.67 ( 3.46%) -19.07%
Lat 99.9th-worker-64 55.00 (56.69%) 47.00 (61.92%) 14.55%

Lat 50.0th-worker-128 8.00 ( 0.00%) 8.67 (13.32%) -8.38%
Lat 90.0th-worker-128 13.33 ( 4.33%) 14.33 ( 8.06%) -7.50%
Lat 99.0th-worker-128 16.00 ( 0.00%) 20.00 ( 8.66%) -25.00%
Lat 99.9th-worker-128 2258.33 (83.80%) 2974.67 (21.82%) -31.72%

Lat 50.0th-worker-256 47.67 ( 2.42%) 45.33 ( 3.37%) 4.91%
Lat 90.0th-worker-256 3470.67 ( 1.88%) 3558.67 ( 0.47%) -2.54%
Lat 99.0th-worker-256 9040.00 ( 2.76%) 9050.67 ( 0.41%) -0.12%
Lat 99.9th-worker-256 13824.00 (20.07%) 13104.00 ( 6.84%) 5.21%

The above data shows mostly regression both in the lesser and
higher load cases.

Hackbench pipe:

Pairs Baseline Avg (s) (Std%) Patched Avg (s) (Std%) % Change
2 2.987 (1.19%) 2.414 (17.99%) 24.06%
4 7.702 (12.53%) 7.228 (18.37%) 6.16%
8 14.141 (1.32%) 13.109 (1.46%) 7.29%
15 27.571 (6.53%) 29.460 (8.71%) -6.84%
30 65.118 (4.49%) 61.352 (4.00%) 5.78%
45 105.086 (9.75%) 97.970 (4.26%) 6.77%
60 149.221 (6.91%) 154.176 (4.17%) -3.32%
75 199.278 (1.21%) 198.680 (1.37%) 0.30%

A lot of run to run variation is seen in hackbench runs. So hard to tell
on the performance but looks better than schbench.

May I know if the cpu frequency was set at a fixed level and deep
cpu idle states were disabled(I assume on power system it is called
stop states?)

In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
when compared to platforms like sapphire rapids and Milan. Didn't go
through this series yet. Will go through and try to understand why
schbench is not happy on Power systems.

Meanwhile, Wanted to know your thoughts on how does smaller LLC
size get impacted with this patch?

task aggregation on smaller LLC domain(both in terms of the
number of CPUs and the size of LLC) might bring cache contention
and hurt performance IMO. May I know what is the cache size on
your system:
lscpu | grep "L3 cache"

May I know if you tested it with:
echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features

vs

echo SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features

And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
from 50 to some smaller values(25, etc) would help?

thanks,
Chenyu

Thanks,
Madadi Vineeth Reddy

With above default settings, task migrations occur less frequently
and no longer happen in the latency-sensitive wake-up path.

[..snip..]

Chen Yu (3):
sched: Several fixes for cache aware scheduling
sched: Avoid task migration within its preferred LLC
sched: Save the per LLC utilization for better cache aware scheduling

K Prateek Nayak (1):
sched: Avoid calculating the cpumask if the system is overloaded

Peter Zijlstra (1):
sched: Cache aware load-balancing

Tim Chen (15):
sched: Add hysteresis to switch a task's preferred LLC
sched: Add helper function to decide whether to allow cache aware
scheduling
sched: Set up LLC indexing
sched: Introduce task preferred LLC field
sched: Calculate the number of tasks that have LLC preference on a
runqueue
sched: Introduce per runqueue task LLC preference counter
sched: Calculate the total number of preferred LLC tasks during load
balance
sched: Tag the sched group as llc_balance if it has tasks prefer other
LLC
sched: Introduce update_llc_busiest() to deal with groups having
preferred LLC tasks
sched: Introduce a new migration_type to track the preferred LLC load
balance
sched: Consider LLC locality for active balance
sched: Consider LLC preference when picking tasks from busiest queue
sched: Do not migrate task if it is moving out of its preferred LLC
sched: Introduce SCHED_CACHE_LB to control cache aware load balance
sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
up

include/linux/mm_types.h | 44 ++
include/linux/sched.h | 8 +
include/linux/sched/topology.h | 3 +
init/Kconfig | 4 +
init/init_task.c | 3 +
kernel/fork.c | 5 +
kernel/sched/core.c | 25 +-
kernel/sched/debug.c | 4 +
kernel/sched/fair.c | 859 ++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 3 +
kernel/sched/sched.h | 23 +
kernel/sched/topology.c | 29 ++
12 files changed, 982 insertions(+), 28 deletions(-)

Next message: Sebastian Reichel: "Re: [PATCH v7 5/6] power: supply: pf1550: add battery charger support"
Previous message: Daniel Sullivan: "[PATCH] gpio: ts5500: use new GPIO line value setter callbacks"
In reply to: Madadi Vineeth Reddy: "Re: [RFC patch v3 00/20] Cache aware scheduling"
Next in thread: Tim Chen: "Re: [RFC patch v3 00/20] Cache aware scheduling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]