Re: [PATCH v5 0/2] sched/fair: Wake short task on current CPU

From: Chen Yu
Date: Mon Feb 20 2023 - 00:59:17 EST


Hi Prateek,
On 2023-02-18 at 01:05:32 +0530, K Prateek Nayak wrote:
> Hello Chenyu and Abel,
>
> I'll leave the detailed results from testing on a dual socket Zen3 system
> (2 x 64C/128T) below.
>
Thanks for the test!
> tl;dr
>
> o Most benchmark results see small wins or are comparable to tip.
> o SpecJBB Max-jOPS see a small hit but Critical-jOPS improve.
I assume that this change should be in acceptible variance rance, because in
previous version, we did not restrict the local wakeup as strictly as current
version, and we did not see a hit on Max-jOPS in v4. Anyway, I've
enhanced the restriction per Abel's feedback and launched some tests,
so that to make Redis and SpecJBB feel better.
> o ycsb-mongodb sees small uplift in NPS1 mode.
> o Numbers for Netperf runs are pending which I'll share in the
> coming week.
> o Abel's suggestion on top of v5 seem promising but there are
> few regressions I notice on larger workloads.
>
> Detailed Results:
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip: 6.2.0-rc6 tip sched/core
> - sis_short: 6.2.0-rc6 tip sched/core + this series
>
> When the testing started, the tip was at:
> commit 4d627628d758 "cpuidle: Fix poll_idle() noinstr annotation"
>
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test: tip sis_short
> 1-groups: 4.38 (0.00 pct) 4.49 (-2.51 pct)
> 2-groups: 5.12 (0.00 pct) 5.20 (-1.56 pct)
> 4-groups: 4.21 (0.00 pct) 4.24 (-0.71 pct)
> 8-groups: 4.68 (0.00 pct) 4.73 (-1.06 pct)
> 16-groups: 6.13 (0.00 pct) 6.35 (-3.58 pct)
>
> o NPS2
>
> Test: tip sis_short
> 1-groups: 4.51 (0.00 pct) 4.36 (3.32 pct)
> 2-groups: 4.31 (0.00 pct) 4.35 (0.92 pct)
> 4-groups: 4.17 (0.00 pct) 4.08 (2.15 pct)
> 8-groups: 4.58 (0.00 pct) 4.49 (1.96 pct)
> 16-groups: 5.74 (0.00 pct) 5.93 (-3.31 pct)
>
> o NPS4
>
> Test: tip sis_short
> 1-groups: 4.47 (0.00 pct) 4.51 (-0.89 pct)
> 2-groups: 4.97 (0.00 pct) 5.04 (-1.40 pct)
> 4-groups: 4.26 (0.00 pct) 4.28 (-0.46 pct)
> 8-groups: 5.46 (0.00 pct) 5.56 (-1.83 pct)
> 16-groups: 6.38 (0.00 pct) 6.10 (4.38 pct)
>
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
>
> o NPS1
>
> #workers: tip sis_short
> 1: 36.00 (0.00 pct) 27.00 (25.00 pct)
> 2: 37.00 (0.00 pct) 32.00 (13.51 pct)
> 4: 41.00 (0.00 pct) 34.00 (17.07 pct)
> 8: 46.00 (0.00 pct) 43.00 (6.52 pct)
> 16: 66.00 (0.00 pct) 66.00 (0.00 pct)
> 32: 111.00 (0.00 pct) 108.00 (2.70 pct)
> 64: 207.00 (0.00 pct) 206.00 (0.48 pct)
> 128: 483.00 (0.00 pct) 481.00 (0.41 pct)
> 256: 46272.00 (0.00 pct) 45120.00 (2.48 pct)
> 512: 76160.00 (0.00 pct) 77696.00 (-2.01 pct)
>
> o NPS2
>
> #workers: tip sis_short
> 1: 33.00 (0.00 pct) 31.00 (6.06 pct)
> 2: 35.00 (0.00 pct) 31.00 (11.42 pct)
> 4: 38.00 (0.00 pct) 38.00 (0.00 pct)
> 8: 51.00 (0.00 pct) 47.00 (7.84 pct)
> 16: 64.00 (0.00 pct) 67.00 (-4.68 pct)
> 32: 118.00 (0.00 pct) 116.00 (1.69 pct)
> 64: 214.00 (0.00 pct) 217.00 (-1.40 pct)
> 128: 497.00 (0.00 pct) 504.00 (-1.40 pct)
> 256: 45632.00 (0.00 pct) 44352.00 (2.80 pct)
> 512: 81024.00 (0.00 pct) 78464.00 (3.15 pct)
>
> o NPS4
>
> #workers: tip sis_short
> 1: 33.00 (0.00 pct) 32.00 (3.03 pct)
> 2: 40.00 (0.00 pct) 32.00 (20.00 pct)
> 4: 42.00 (0.00 pct) 38.00 (9.52 pct)
> 8: 64.00 (0.00 pct) 65.00 (-1.56 pct)
> 16: 73.00 (0.00 pct) 69.00 (5.47 pct)
> 32: 112.00 (0.00 pct) 112.00 (0.00 pct)
> 64: 215.00 (0.00 pct) 207.00 (3.72 pct)
> 128: 615.00 (0.00 pct) 593.00 (3.73 pct)
> 256: 46144.00 (0.00 pct) 45376.00 (1.66 pct)
> 512: 78208.00 (0.00 pct) 77696.00 (0.65 pct)
>
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> o NPS1
>
> Clients: tip sis_short
> 1 536.78 (0.00 pct) 537.38 (0.11 pct)
> 2 1050.74 (0.00 pct) 1058.74 (0.76 pct)
> 4 1993.47 (0.00 pct) 1976.79 (-0.83 pct)
> 8 3498.02 (0.00 pct) 3657.16 (4.54 pct)
> 16 6202.01 (0.00 pct) 6014.62 (-3.02 pct)
> 32 11544.55 (0.00 pct) 11847.47 (2.62 pct)
> 64 21828.75 (0.00 pct) 21754.85 (-0.33 pct)
> 128 31095.92 (0.00 pct) 31643.35 (1.76 pct)
> 256 54828.12 (0.00 pct) 55432.29 (1.10 pct)
> 512 54888.10 (0.00 pct) 55917.91 (1.87 pct)
> 1024 54916.75 (0.00 pct) 53468.79 (-2.63 pct)
>
> o NPS2
>
> Clients: tip sis_short
> 1 543.08 (0.00 pct) 544.49 (0.25 pct)
> 2 1074.55 (0.00 pct) 1060.33 (-1.32 pct)
> 4 1980.75 (0.00 pct) 1992.86 (0.61 pct)
> 8 3628.36 (0.00 pct) 3507.73 (-3.32 pct)
> 16 5806.00 (0.00 pct) 5790.82 (-0.26 pct)
> 32 11351.94 (0.00 pct) 10937.21 (-3.26 pct)
> 64 19987.40 (0.00 pct) 20739.38 (3.76 pct)
> 128 29554.40 (0.00 pct) 30011.99 (1.54 pct)
> 256 53594.11 (0.00 pct) 51473.78 (-3.95 pct)
> 512 54304.03 (0.00 pct) 52998.31 (-2.40 pct)
> 1024 54338.25 (0.00 pct) 53265.51 (-1.97 pct)
>
> o NPS4
>
> Clients: tip sis_short
> 1 541.29 (0.00 pct) 536.21 (-0.93 pct)
> 2 1045.15 (0.00 pct) 1054.94 (0.93 pct)
> 4 1973.01 (0.00 pct) 1988.63 (0.79 pct)
> 8 3490.55 (0.00 pct) 3535.27 (1.28 pct)
> 16 5920.12 (0.00 pct) 5846.04 (-1.25 pct)
> 32 10933.38 (0.00 pct) 10944.33 (0.10pct)
> 64 19628.34 (0.00 pct) 19328.66 (1.01 pct)
> 128 29785.23 (0.00 pct) 28749.48 (-4.55 pct)
> 256 51999.72 (0.00 pct) 51336.20 (-1.27 pct)
> 512 53619.42 (0.00 pct) 53269.04 (-0.65 pct)
> 1024 53956.57 (0.00 pct) 53666.14 (-0.53 pct)
>
>
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> 10 Runs:
>
> Test: tip sis_short
> Copy: 320576.16 (0.00 pct) 328194.56 (2.37 pct)
> Scale: 212869.80 (0.00 pct) 216713.96 (1.80 pct)
> Add: 241556.74 (0.00 pct) 247467.26 (2.44 pct)
> Triad: 250637.58 (0.00 pct) 245538.49 (-2.03 pct)
>
> 100 Runs:
>
> Test: tip sis_short
> Copy: 330058.38 (0.00 pct) 329339.60 (-0.21 pct)
> Scale: 216475.85 (0.00 pct) 219334.10 (1.32 pct)
> Add: 243028.82 (0.00 pct) 244037.77 (0.41 pct)
> Triad: 252907.98 (0.00 pct) 257210.37 (1.70 pct)
>
> o NPS2
>
> 10 Runs:
>
> Test: tip sis_short
> Copy: 339946.34 (0.00 pct) 327261.79 (-3.73 pct)
> Scale: 217453.46 (0.00 pct) 221366.66 (1.79 pct)
> Add: 258099.63 (0.00 pct) 258472.44 (0.14 pct)
> Triad: 264974.76 (0.00 pct) 262618.99 (-0.88 pct)
>
> 100 Runs:
>
> Test: tip sis_short
> Copy: 335725.30 (0.00 pct) 320797.67 (-4.44 pct)
> Scale: 229985.45 (0.00 pct) 221706.62 (-3.59 pct)
> Add: 260546.33 (0.00 pct) 250668.80 (-3.79 pct)
> Triad: 267925.27 (0.00 pct) 262959.86 (-1.85 pct)
>
> o NPS4
>
> 10 Runs:
>
> Test: tip sis_short
> Copy: 369037.34 (0.00 pct) 371514.46 (0.67 pct)
> Scale: 238235.39 (0.00 pct) 237661.29 (-0.24 pct)
> Add: 263626.48 (0.00 pct) 263436.20 (-0.07 pct)
> Triad: 280881.43 (0.00 pct) 288059.52 (2.55 pct)
>
> 100 Runs:
>
> Test: tip sis_short
> Copy: 339036.66 (0.00 pct) 346904.09 (2.32 pct)
> Scale: 246638.02 (0.00 pct) 230195.65 (-6.66 pct)
> Add: 259898.86 (0.00 pct) 244631.77 (-5.87 pct)
> Triad: 265719.02 (0.00 pct) 264620.50 (-0.41 pct)
>
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
>
> o NPS1:
>
> tip : 133514.00 (var: 2.07%)
> sis-short : 137664.67 (var: 1.45%) (3.11%)
>
> o NPS2:
>
> tip : 132193.33 (var: 1.46%)
> sis-short : 131189.33 (var: 1.69%) (-0.75%)
>
> o NPS4:
>
> tip : 133285.67 (var: 1.77%)
> sis-short : 133891.33 (var: 1.58%) (0.45%)
>
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test Metric Parallelism tip sis_short
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48665321.00 ( 0.00%) 48553432.30 ( -0.23%)
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6281376826.80 ( 0.00%) 6277335150.50 ( -0.06%)
> unixbench-syscall Amean unixbench-syscall-1 2689026.67 ( 0.00%) 2682044.73 * 0.26%*
> unixbench-syscall Amean unixbench-syscall-512 7352453.23 ( 0.00%) 7290524.47 * -0.84%*
> unixbench-pipe Hmean unixbench-pipe-1 2467955.46 ( 0.00%) 2426076.17 * -1.70%*
> unixbench-pipe Hmean unixbench-pipe-512 295937232.39 ( 0.00%) 293462420.03 * -0.84%*
> unixbench-spawn Hmean unixbench-spawn-1 4164.75 ( 0.00%) 4229.59 ( 1.56%)
> unixbench-spawn Hmean unixbench-spawn-512 79950.80 ( 0.00%) 76439.30 ( -4.39%)
> unixbench-execl Hmean unixbench-execl-1 4112.25 ( 0.00%) 4151.37 ( 0.95%)
> unixbench-execl Hmean unixbench-execl-512 11785.88 ( 0.00%) 11756.46 ( -0.25%)
>
> o NPS2
>
> Test Metric Parallelism tip sis_short
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49671827.09 ( 0.00%) 49077076.00 ( -1.20%)
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6282239821.90 ( 0.00%) 6283671307.30 ( 0.02%)
> unixbench-syscall Amean unixbench-syscall-1 2688504.20 ( 0.00%) 2676278.60 * 0.45%*
> unixbench-syscall Amean unixbench-syscall-512 7321621.07 ( 0.00%) 7784926.60 * 6.33%*
> unixbench-pipe Hmean unixbench-pipe-1 2469941.97 ( 0.00%) 2419584.09 * -2.04%*
> unixbench-pipe Hmean unixbench-pipe-512 296146392.10 ( 0.00%) 293156913.86 * -1.01%*
> unixbench-spawn Hmean unixbench-spawn-1 5029.05 ( 0.00%) 5015.18 ( -0.28%)
> unixbench-spawn Hmean unixbench-spawn-512 77198.79 ( 0.00%) 80409.23 * 4.16%*
> unixbench-execl Hmean unixbench-execl-1 4092.59 ( 0.00%) 4158.36 * 1.61%*
> unixbench-execl Hmean unixbench-execl-512 12293.67 ( 0.00%) 12169.31 ( -1.01%)
>
> o NPS4
>
> Test Metric Parallelism tip sis_short
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48944542.05 ( 0.00%) 49490899.03 * 1.12%*
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6291259625.50 ( 0.00%) 6299305899.90 ( 0.13%)
> unixbench-syscall Amean unixbench-syscall-1 2686991.73 ( 0.00%) 2682940.53 * 0.15%*
> unixbench-syscall Amean unixbench-syscall-512 7902201.47 ( 0.00%) 7931906.47 ( -0.38%)
> unixbench-pipe Hmean unixbench-pipe-1 2468813.43 ( 0.00%) 2422272.88 * -1.89%*
> unixbench-pipe Hmean unixbench-pipe-512 297109244.52 ( 0.00%) 294589928.27 * -0.85%*
> unixbench-spawn Hmean unixbench-spawn-1 5161.67 ( 0.00%) 5012.58 ( -2.89%)
> unixbench-spawn Hmean unixbench-spawn-512 78657.60 ( 0.00%) 78572.80 ( -0.11%)
> unixbench-execl Hmean unixbench-execl-1 4112.02 ( 0.00%) 4122.16 ( 0.25%)
> unixbench-execl Hmean unixbench-execl-512 13700.99 ( 0.00%) 14173.20 * 3.44%*
>
> ~~~~~~~~~~~
> ~ SpecJBB ~
> ~~~~~~~~~~~
>
> o NPS1 - Normalized to baseline (tip)
>
> Kernel tip sis_short
> Max-jOPS 100% 98.53%
> Critical-jOPS 100% 105.61%
>
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> o NPS1 - Normalized to baseline (tip)
>
> Kernel : tip sis_short
> 8C/16T : 100.00% 100.54%
> 16C/32T : 100.00% 100.19%
> 32C/64T : 100.00% 98.08%
> 64C/128T : 100.00% 98.34%
>
>
> --------------- With Abel's suggestion added to v5 ---------------
>
> I've added the hunk suggested by Abel in the thread to the v5 and
> following are results for the same set of benchmarks but only for
> machine running in NPS1 mode.
>
> sis_short_v5.1: 6.2.0-rc6 tip sched/core + this series + Abel's suggestion
>
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test: tip sis_short_v5.1
> 1-groups: 4.38 (0.00 pct) 4.08 (6.84 pct)
> 2-groups: 5.12 (0.00 pct) 5.10 (0.39 pct)
> 4-groups: 4.21 (0.00 pct) 4.23 (-0.47 pct)
> 8-groups: 4.68 (0.00 pct) 4.69 (-0.21 pct)
> 16-groups: 6.13 (0.00 pct) 5.94 (3.09 pct)
>
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
>
> o NPS1
>
> #workers: tip sis_short_v5.1
> 1: 36.00 (0.00 pct) 36.00 (0.00 pct)
> 2: 37.00 (0.00 pct) 39.00 (-5.40 pct)
> 4: 41.00 (0.00 pct) 40.00 (2.43 pct)
> 8: 46.00 (0.00 pct) 46.00 (0.00 pct)
> 16: 66.00 (0.00 pct) 68.00 (-3.03 pct)
> 32: 111.00 (0.00 pct) 112.00 (-0.90 pct)
> 64: 207.00 (0.00 pct) 238.00 (-14.97 pct)
> 64: 227.00 (0.00 pct) 219.00 (3.52 pct)
> 128: 483.00 (0.00 pct) 494.00 (-2.27 pct)
> 256: 46272.00 (0.00 pct) 41280.00 (10.78 pct)
> 512: 78293.00 (0.00 pct) 79325.00 (-1.31 pct)
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> o NPS1
>
> Clients: tip sis_short_v5.1
> 1 536.78 (0.00 pct) 535.90 (-0.16 pct)
> 2 1050.74 (0.00 pct) 1067.32 (1.57 pct)
> 4 1993.47 (0.00 pct) 1971.63 (-1.09 pct)
> 8 3601.77 (0.00 pct) 3599.17 (-0.07 pct)
> 16 6202.01 (0.00 pct) 6115.08 (-1.40 pct)
> 32 11544.55 (0.00 pct) 11423.52 (-1.04 pct)
> 64 21828.75 (0.00 pct) 21403.94 (-1.94 pct)
> 128 31095.92 (0.00 pct) 30783.55 (-1.00 pct)
> 256 54828.12 (0.00 pct) 55328.94 (0.91 pct)
> 512 54888.10 (0.00 pct) 53483.33 (-2.55 pct)
> 1024 48407.14 (0.00 pct) 48998.95 (1.22 pct)
>
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> 10 Runs:
>
> Test: tip sis_short_v5.1
> Copy: 320576.16 (0.00 pct) 331810.14 (3.50 pct)
> Scale: 212869.80 (0.00 pct) 214725.82 (0.87 pct)
> Add: 241556.74 (0.00 pct) 242340.92 (0.32 pct)
> Triad: 250637.58 (0.00 pct) 251271.53 (0.25 pct)
>
> 100 Runs:
>
> Test: tip sis_short_v5.1
> Copy: 330058.38 (0.00 pct) 331966.60 (0.57 pct)
> Scale: 216475.85 (0.00 pct) 222777.84 (2.91 pct)
> Add: 243028.82 (0.00 pct) 250873.78 (3.22 pct)
> Triad: 252907.98 (0.00 pct) 253791.20 (0.34 pct)
>
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
>
> o NPS1:
>
> tip : 133514.00 (var: 2.07%)
> sis-short_v5.1 : 129172.67 (var: 2.32%) (-3.25%) **
>
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test Metric Parallelism tip sis_short_v5.1
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49266026.90 ( 0.00%) 49054799.90 ( -0.43%)
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6285063007.68 ( 0.00%) 6280424934.15 ( -0.07%)
> unixbench-syscall Amean unixbench-syscall-1 2689026.67 ( 0.00%) 2677968.03 * 0.41%*
> unixbench-syscall Amean unixbench-syscall-512 7352453.23 ( 0.00%) 7354325.40 ( -0.03%)
> unixbench-pipe Hmean unixbench-pipe-1 2467955.46 ( 0.00%) 2351117.60 * -4.73%*
> unixbench-pipe Hmean unixbench-pipe-512 295937232.39 ( 0.00%) 295769918.99 ( -0.06%)
> unixbench-spawn Hmean unixbench-spawn-1 4164.75 ( 0.00%) 4331.89 * 4.01%*
> unixbench-spawn Hmean unixbench-spawn-512 79626.61 ( 0.00%) 77865.32 * -2.21%*
> unixbench-execl Hmean unixbench-execl-1 4112.25 ( 0.00%) 4145.85 ( 0.82%)
> unixbench-execl Hmean unixbench-execl-512 11785.88 ( 0.00%) 11935.41 ( 1.27%)
>
> ~~~~~~~~~~~
> ~ SpecJBB ~
> ~~~~~~~~~~~
>
> o NPS1 - Normalized to baseline (tip)
>
> Kernel tip sis_short_V5.1
> Max-jOPS 100% 91.99% ** (-8.01%)
> Critical-jOPS 100% 99.29%
>
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> o NPS1 - Throughput normalized to baseline (tip)
>
> Kernel : tip sis_short_V5.1
> 8C/16T : 100.00% 93.75% ** (-6.25%)
> 16C/32T : 100.00% 100.43%
> 32C/64T : 100.00% 101.12%
> 64C/128T : 100.00% 100.21%
>
> o Follow wake_affine_bias() if waker's cpu and prev_cpu are on same LLC?
>
> There are cases with Abel's suggestion where some of the larger
> benchmark regresses. I wonder if wake_affine_bias() can still be
> considered for short running tasks if the waker's CPU and the
> prev_cpu share caches. In DeathStarBench 8C/16T case, the
> services are all pinned to the CPUs of same MC domain. The
> regression observed seems to arise from the missed opportunity
> to distribute load among the CPUs sharing the same L3. I do not
> have data for this currently but I'll update the thread with any
> findings.
Good observation. Just like select_idle_sibling(), prev cpu should only
be chosen if the target cpu and prev cpu shares the LLC cache. My next
version still prefers current cpu than prev cpu, and add the wake_flips
check to aggregate tasks on current CPU only the waker and wakee wakes
up each other frequently. This could somehow mitigate the problem Abel
mentioned, too many short tasks are stacked on current CPU.

thanks,
Chenyu
>
> I'll also queue up a Redis run from mmtest to see if I can reproduce
> Abel's observations on my system however I'm not sure if the
> utilization will be high enough to emulate the same scenario as
> Abel's prod environment. If the migrations within the same MC
>
> On 2/3/2023 10:47 AM, Chen Yu wrote:
> > The main purpose is to avoid too many cross CPU wake up when it is
> > unnecessary. The frequent cross CPU wake up brings significant damage
> > to some workloads, especially on high core count systems.
> >
> > Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
> > if both the waker and wakee are short-duration tasks. The short
> > duration task could become a trouble maker on high-load system,
> > because it could bring frequent context switch. So this strategy
> > only takes effect when the system is busy. Besides, it is unreasonable
> > to inhibit the idle CPU scan when there are still idle CPUs.
> >
> > First, introduce the definition of a short-duration task. Then
> > leverages the first patch to choose a local CPU for wakee.
> >
> > Overall there is significant performance improvement on Intel
> > 2 x 56C/112T platform. Such as will-it-scale (1200+%),
> > netperf(600+%) in some cases. And no noticeable impact on
> > schbench, hackbench, tbench and a OLTP workload with a commercial RDBMS.
> >
> > Seeking for test results on other platforms, such as Zen3 and Kunpeng
> > Arm64. Appreciated Prateek and Yicong if you can have a try on this
> > version.
> >
> > Changes since v4:
> > 1. Dietmar has commented on the task duration calculation. So refined
> > the commit log to reduce confusion.
> > 2. Change [PATCH 1/2] to only record the average duration of a task.
> > So this change could benefit UTIL_EST_FASTER[1].
> > 3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
> > the system average utilization restriction that, if the system
> > is not busy, do not enable the short wake up. Above logic has
> > shown improvment on Zen3[2].
> > 4. Restrict the wakeup target to be current CPU, rather than both
> > current CPU and task's previous CPU. This could also benefit
> > wakeup optimization from interrupt in the future, which is
> > suggested by Yicong.
> >
> > Changes since v3:
> > 1. Honglei and Josh have concern that the threshold of short
> > task duration could be too long. Decreased the threshold from
> > sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
> > and the '8' comes from get_update_sysctl_factor().
> > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> > 3. Move the calculation of average duration from put_prev_task_fair()
> > to dequeue_task_fair(). Because there is an issue in v3 that,
> > put_prev_task_fair() will not be invoked by pick_next_task_fair()
> > in fast path, thus the dur_avg could not be updated timely.
> > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
> > on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> > 5. Move the scan for CPU with short duration task from select_idle_cpu()
> > to select_idle_siblings(), because there is no CPU scan involved, per
> > Yicong.
> >
> > Changes since v2:
> >
> > 1. Peter suggested comparing the duration of waker and the cost to
> > scan for an idle CPU: If the cost is higher than the task duration,
> > do not waste time finding an idle CPU, choose the local or previous
> > CPU directly. A prototype was created based on this suggestion.
> > However, according to the test result, this prototype does not inhibit
> > the cross CPU wakeup and did not bring improvement. Because the cost
> > to find an idle CPU is small in the problematic scenario. The root
> > cause of the problem is a race condition between scanning for an idle
> > CPU and task enqueue(please refer to the commit log in PATCH 2/2).
> > So v3 does not change the core logic of v2, with some refinement based
> > on Peter's suggestion.
> >
> > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
> >
> >
> > [1] https://lore.kernel.org/lkml/c56855a7-14fd-4737-fc8b-8ea21487c5f6@xxxxxxx/
> > [2] https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@xxxxxxxxx/
> >
> > v4: https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@xxxxxxxxx/
> > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@xxxxxxxxx/
> > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@xxxxxxxxx/
> > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@xxxxxxxxx/
> >
> > Chen Yu (2):
> > sched/fair: Record the average duration of a task
> > sched/fair: Introduce SIS_SHORT to wake up short task on current CPU
> >
> > include/linux/sched.h | 3 +++
> > kernel/sched/core.c | 2 ++
> > kernel/sched/debug.c | 1 +
> > kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 5 files changed, 46 insertions(+)
> >
>
> The netperf results are still pending and I'll update the thread
> with the same in the coming week. If you would like me to test
> or gather some data for specific workload on the test system,
> please do let me know.
> --
I'll launch more tests and sent out the result later.

thanks,
Chenyu