Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

From: Jianyong Wu
Date: Thu Jun 19 2025 - 02:09:06 EST


Hi Prateek,

Thank you for taking the time to test this patch.

This patch aims to reduce meaningless task migrations, such as those in iperf tests, which having not considered performance so much. In my iperf tests, there wasn't significant performance improvement observed. (Notably, the number of task migrations decreased substantially.) Even when I bound iperf tasks to the same LLC, performance metrics didn't improve significantly. Therefore, this change is unlikely to enhance iperf performance notably, indicating that task migration has minimal effect on iperf tests.

IMO, we should allow at least two tasks per LLC to enable inter-task communication. Theoretically, this could yield better performance, even though I haven't found a valid scenario to support this yet.

If this change has bad effect for performance, is there any suggestion to mitigate the iperf migration issue? Or just leave it there?

Any suggestions would be greatly appreciated.

Thanks
Jianyong

On 6/18/2025 2:37 PM, K Prateek Nayak wrote:
Hello Jianyong,

On 6/16/2025 7:52 AM, Jianyong Wu wrote:
Would you mind letting me know if you've had a chance to try it out, or if there's any update on the progress?

Here are my results from a dual socket 3rd Generation EPYC
system.

tl;dr I don't see any improvement and a few regressions too
but few of those data points also have a lot of variance.

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:       tip:sched/core at commit 914873bc7df9 ("Merge tag
           'x86-build-2025-05-25' of
           git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

allow_imb: tip + this series as is

o Benchmark results

    ==================================================================
    Test          : hackbench
    Units         : Normalized time in seconds
    Interpretation: Lower is better
    Statistic     : AMean
    ==================================================================
    Case:           tip[pct imp](CV)     allow_imb[pct imp](CV)
     1-groups     1.00 [ -0.00](13.74)     1.03 [ -3.20]( 9.18)
     2-groups     1.00 [ -0.00]( 9.58)     1.06 [ -6.46]( 7.63)
     4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 1.90)
     8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.42]( 0.91)
    16-groups     1.00 [ -0.00]( 1.10)     0.99 [  1.09]( 1.13)


    ==================================================================
    Test          : tbench
    Units         : Normalized throughput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
        1     1.00 [  0.00]( 0.82)     1.01 [  1.11]( 0.27)
        2     1.00 [  0.00]( 1.13)     1.00 [ -0.05]( 0.62)
        4     1.00 [  0.00]( 1.12)     1.02 [  2.36]( 0.19)
        8     1.00 [  0.00]( 0.93)     1.01 [  1.02]( 0.86)
       16     1.00 [  0.00]( 0.38)     1.01 [  0.71]( 1.71)
       32     1.00 [  0.00]( 0.66)     1.01 [  1.31]( 1.88)
       64     1.00 [  0.00]( 1.18)     0.98 [ -1.60]( 2.90)
      128     1.00 [  0.00]( 1.12)     1.02 [  1.60]( 0.42)
      256     1.00 [  0.00]( 0.42)     1.00 [  0.40]( 0.80)
      512     1.00 [  0.00]( 0.14)     1.01 [  0.97]( 0.25)
     1024     1.00 [  0.00]( 0.26)     1.01 [  1.29]( 0.19)


    ==================================================================
    Test          : stream-10
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
     Copy     1.00 [  0.00]( 8.37)     1.01 [  1.00]( 5.71)
    Scale     1.00 [  0.00]( 2.85)     0.98 [ -1.94]( 5.23)
      Add     1.00 [  0.00]( 3.39)     0.99 [ -1.39]( 4.77)
    Triad     1.00 [  0.00]( 6.39)     1.05 [  5.15]( 5.62)


    ==================================================================
    Test          : stream-100
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
     Copy     1.00 [  0.00]( 3.91)     1.01 [  1.28]( 2.01)
    Scale     1.00 [  0.00]( 4.34)     0.99 [ -0.65]( 3.74)
      Add     1.00 [  0.00]( 4.14)     1.01 [  0.54]( 1.63)
    Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.28]( 4.89)


    ==================================================================
    Test          : netperf
    Units         : Normalized Througput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
     1-clients     1.00 [  0.00]( 0.41)     1.01 [  1.17]( 0.39)
     2-clients     1.00 [  0.00]( 0.58)     1.01 [  1.00]( 0.40)
     4-clients     1.00 [  0.00]( 0.35)     1.01 [  0.73]( 0.50)
     8-clients     1.00 [  0.00]( 0.48)     1.00 [  0.42]( 0.67)
    16-clients     1.00 [  0.00]( 0.66)     1.01 [  0.84]( 0.57)
    32-clients     1.00 [  0.00]( 1.15)     1.01 [  0.82]( 0.96)
    64-clients     1.00 [  0.00]( 1.38)     1.00 [ -0.24]( 3.09)
    128-clients    1.00 [  0.00]( 0.87)     1.00 [ -0.16]( 1.02)
    256-clients    1.00 [  0.00]( 5.36)     1.01 [  0.66]( 4.55)
    512-clients    1.00 [  0.00](54.39)     0.98 [ -1.59](57.35)


    ==================================================================
    Test          : schbench
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      1     1.00 [ -0.00]( 8.54)     1.04 [ -4.35]( 3.69)
      2     1.00 [ -0.00]( 1.15)     0.96 [  4.00]( 0.00)
      4     1.00 [ -0.00](13.46)     1.02 [ -2.08]( 2.04)
      8     1.00 [ -0.00]( 7.14)     0.82 [ 17.54]( 9.30)
     16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 7.83)
     32     1.00 [ -0.00]( 1.06)     1.01 [ -1.06]( 5.88)
     64     1.00 [ -0.00]( 5.48)     1.05 [ -4.65]( 2.71)
    128     1.00 [ -0.00](10.45)     1.09 [ -9.11](14.18)
    256     1.00 [ -0.00](31.14)     1.05 [ -5.15]( 9.79)
    512     1.00 [ -0.00]( 1.52)     0.96 [  4.30]( 0.26)


    ==================================================================
    Test          : new-schbench-requests-per-second
    Units         : Normalized Requests per second
    Interpretation: Higher is better
    Statistic     : Median
    ==================================================================
    #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.61)
      2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.26)
      4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)
      8     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.15)
     16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
     32     1.00 [  0.00]( 3.41)     0.97 [ -2.86]( 2.91)
     64     1.00 [  0.00]( 1.05)     0.97 [ -3.17]( 7.39)
    128     1.00 [  0.00]( 0.00)     1.00 [ -0.38]( 0.39)
    256     1.00 [  0.00]( 0.72)     1.01 [  0.61]( 0.96)
    512     1.00 [  0.00]( 0.57)     1.01 [  0.72]( 0.21)


    ==================================================================
    Test          : new-schbench-wakeup-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      1     1.00 [ -0.00]( 9.11)     0.69 [ 31.25]( 8.13)
      2     1.00 [ -0.00]( 0.00)     0.93 [  7.14]( 8.37)
      4     1.00 [ -0.00]( 3.78)     1.07 [ -7.14](14.79)
      8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33]( 7.56)
     16     1.00 [ -0.00]( 7.56)     1.08 [ -7.69](34.36)
     32     1.00 [ -0.00](15.11)     1.00 [ -0.00](12.99)
     64     1.00 [ -0.00]( 9.63)     0.80 [ 20.00](11.17)
    128     1.00 [ -0.00]( 4.86)     0.98 [  2.01](13.01)
    256     1.00 [ -0.00]( 2.34)     1.01 [ -1.00]( 3.51)
    512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)


    ==================================================================
    Test          : new-schbench-request-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 3.51)
      2     1.00 [ -0.00]( 0.87)     0.99 [  0.54]( 3.29)
      4     1.00 [ -0.00]( 1.21)     1.06 [ -5.92]( 0.82)
      8     1.00 [ -0.00]( 0.27)     1.03 [ -3.15]( 1.86)
     16     1.00 [ -0.00]( 4.04)     1.00 [ -0.27]( 2.27)
     32     1.00 [ -0.00]( 7.35)     1.30 [-30.45](20.57)
     64     1.00 [ -0.00]( 3.54)     1.01 [ -0.67]( 0.82)
    128     1.00 [ -0.00]( 0.37)     1.00 [  0.21]( 0.18)
    256     1.00 [ -0.00]( 9.57)     0.99 [  1.43]( 7.69)
    512     1.00 [ -0.00]( 1.82)     1.02 [ -2.10]( 0.89)


    ==================================================================
    Test          : Various longer running benchmarks
    Units         : %diff in throughput reported
    Interpretation: Higher is better
    Statistic     : Median
    ==================================================================
    Benchmarks:                  %diff
    ycsb-cassandra               0.07%
    ycsb-mongodb                -0.66%

    deathstarbench-1x            0.36%
    deathstarbench-2x            2.39%
    deathstarbench-3x           -0.09%
    deathstarbench-6x            1.53%

    hammerdb+mysql 16VU         -0.27%
    hammerdb+mysql 64VU         -0.32%

---

I cannot make a hard case for this optimization. You can perhaps
share your iperf numbers if you are seeing significant
improvements there.