Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

From: K Prateek Nayak
Date: Thu Oct 26 2023 - 23:27:26 EST

Next message: Chun-Hung Wu (巫駿宏): "Re: [PATCH v1 1/1] ufs: core: Add host quirk QUIRK_MCQ_EXPAND_QUEUE_SLOT"
Previous message: Lakshmi Yadlapati: "RE: [PATCH v4] hwmon: (pmbus/max31785) Add delay between bus accesses"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Mathieu,

On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
> Hi,
>
> This series introduces two new scheduler features: UTIL_FITS_CAPACITY
> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
> a hackbench workload which leaves some idle CPU time on a 192-core AMD
> EPYC.
>
> The main metrics which are significantly improved are:
>
> - cpu-migrations are reduced by 80%,
> - CPU utilization is increased by 17%.
>
> Feedback is welcome. I am especially interested to learn whether this
> series has positive or detrimental effects on performance of other
> workloads.

I got a chance to test this series on a dual socket 3rd Generation EPYC
System (2 x 64C/128T). Following is a quick summary:

- stream and ycsb-mongodb don't see any changes.

- hackbench and DeathStarBench see a major improvement. Both are high
utilization workloads with CPUs being overloaded most of the time.
DeathStarBench is known to benefit from lower migration count. It was
discussed by Gautham at OSPM '23.

- tbench, netperf, and sch bench regresses. The former two when the
system is near fully loaded, and the latter for most cases. All these
benchmarks are client-server / messenger-worker oriented and is
known to perform better when client-server / messenger-worker are on
same CCX (LLC domain).

Detailed results are as follows:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel Details

- tip: tip:sched/core at commit 984ffb6a4366 ("sched/fair: Remove
SIS_PROP")

- wake_prev_bias: tip + this series + Peter's suggestion to optimize
sched_util_fits_capacity_active()

I've taken liberty at resolving the conflict with recently added cluster
wakeup optimization by prioritizing "SELECT_BIAS_PREV" feature.
select_idle_sibling() looks as follows:

select_idle_sibling(...)
{

...

/*
* With the SELECT_BIAS_PREV feature, if the previous CPU is
* cache affine, prefer the previous CPU when all CPUs are busy
* to inhibit migration.
*/
if (sched_feat(SELECT_BIAS_PREV) &&
prev != target && cpus_share_cache(prev, target))
return prev;

/*
* For cluster machines which have lower sharing cache like L2 or
* LLC Tag, we tend to find an idle CPU in the target's cluster
* first. But prev_cpu or recent_used_cpu may also be a good candidate,
* use them if possible when no idle CPU found in select_idle_cpu().
*/
if ((unsigned int)prev_aff < nr_cpumask_bits)
return prev_aff;
if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
return recent_used_cpu;

return target;
}

Please let me know if you have a different ordering in mind.

o Benchmark results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.88) 0.97 [ 2.88]( 1.78)
2-groups 1.00 [ -0.00]( 2.03) 0.91 [ 8.79]( 1.19)
4-groups 1.00 [ -0.00]( 1.42) 0.87 [ 13.07]( 1.77)
8-groups 1.00 [ -0.00]( 1.37) 0.86 [ 13.70]( 0.98)
16-groups 1.00 [ -0.00]( 2.54) 0.90 [ 9.74]( 1.65)

==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1 1.00 [ 0.00]( 0.63) 0.99 [ -0.53]( 0.97)
2 1.00 [ 0.00]( 0.89) 1.00 [ 0.21]( 0.99)
4 1.00 [ 0.00]( 1.34) 1.01 [ 0.70]( 0.88)
8 1.00 [ 0.00]( 0.49) 1.00 [ 0.40]( 0.55)
16 1.00 [ 0.00]( 1.51) 0.99 [ -0.51]( 1.23)
32 1.00 [ 0.00]( 0.74) 0.97 [ -2.57]( 0.59)
64 1.00 [ 0.00]( 0.92) 0.95 [ -4.69]( 0.70)
128 1.00 [ 0.00]( 0.97) 0.91 [ -8.58]( 0.29)
256 1.00 [ 0.00]( 1.14) 0.90 [ -9.86]( 2.40)
512 1.00 [ 0.00]( 0.35) 0.97 [ -2.91]( 1.78)
1024 1.00 [ 0.00]( 0.07) 0.96 [ -4.15]( 1.43)

==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
Copy 1.00 [ 0.00]( 8.25) 1.04 [ 3.53](10.84)
Scale 1.00 [ 0.00]( 5.65) 0.99 [ -0.85]( 5.94)
Add 1.00 [ 0.00]( 5.73) 1.00 [ 0.50]( 7.68)
Triad 1.00 [ 0.00]( 3.41) 1.00 [ 0.12]( 6.25)

==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
Copy 1.00 [ 0.00]( 1.75) 1.01 [ 1.18]( 1.61)
Scale 1.00 [ 0.00]( 0.92) 1.00 [ -0.14]( 1.37)
Add 1.00 [ 0.00]( 0.32) 0.99 [ -0.54]( 1.34)
Triad 1.00 [ 0.00]( 5.97) 1.00 [ 0.37]( 6.34)

==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.67) 1.00 [ 0.08]( 0.15)
2-clients 1.00 [ 0.00]( 0.15) 1.00 [ 0.10]( 0.57)
4-clients 1.00 [ 0.00]( 0.58) 1.00 [ 0.10]( 0.74)
8-clients 1.00 [ 0.00]( 0.46) 1.00 [ 0.31]( 0.64)
16-clients 1.00 [ 0.00]( 0.84) 0.99 [ -0.56]( 1.78)
32-clients 1.00 [ 0.00]( 1.07) 1.00 [ 0.04]( 1.40)
64-clients 1.00 [ 0.00]( 1.53) 1.01 [ 0.68]( 2.27)
128-clients 1.00 [ 0.00]( 1.17) 0.99 [ -0.70]( 1.17)
256-clients 1.00 [ 0.00]( 5.42) 0.91 [ -9.31](10.74)
512-clients 1.00 [ 0.00](48.07) 1.00 [ -0.07](47.71)

==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1 1.00 [ -0.00](12.00) 1.06 [ -5.56]( 2.99)
2 1.00 [ -0.00]( 6.96) 1.08 [ -7.69]( 2.38)
4 1.00 [ -0.00](13.57) 1.07 [ -7.32](12.95)
8 1.00 [ -0.00]( 6.45) 0.98 [ 2.08](10.86)
16 1.00 [ -0.00]( 3.45) 1.02 [ -1.72]( 1.69)
32 1.00 [ -0.00]( 3.00) 1.05 [ -5.00](10.92)
64 1.00 [ -0.00]( 2.18) 1.04 [ -4.17]( 1.15)
128 1.00 [ -0.00]( 7.15) 1.07 [ -6.65]( 8.45)
256 1.00 [ -0.00](30.20) 1.72 [-72.03](30.62)
512 1.00 [ -0.00]( 4.90) 0.97 [ 3.25]( 1.92)

==================================================================
Test : ycsb-mondodb
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
metric tip wake_prev_bias(%diff)
throughput 1.00 0.99 (%diff: -0.94%)

==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip wake_prev_bias(%diff)
1CCD 1 1.00 1.10 (%diff: 10.04%)
2CCD 2 1.00 1.06 (%diff: 5.90%)
4CCD 4 1.00 1.04 (%diff: 3.74%)
8CCD 8 1.00 1.03 (%diff: 2.98%)

--
It is a mixed bag of results, as expected. I would love to hear your
thoughts on the results. Meanwhile, I'll try to get some more data
from other benchmarks.

>
> [..snip..]
>

--
Thanks and Regards,
Prateek

Next message: Chun-Hung Wu (巫駿宏): "Re: [PATCH v1 1/1] ufs: core: Add host quirk QUIRK_MCQ_EXPAND_QUEUE_SLOT"
Previous message: Lakshmi Yadlapati: "RE: [PATCH v4] hwmon: (pmbus/max31785) Add delay between bus accesses"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]