Re: [RFC PATCH v4 0/2] sched/fair: Choose the CPU where short task is running during wake up

From: K Prateek Nayak
Date: Mon Jan 16 2023 - 05:54:10 EST


Hello Chenyu,

On 12/30/2022 8:17 AM, Chen Yu wrote:
> On 2022-12-29 at 12:46:59 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> Including the detailed results from testing below.
>>
>> tl;dr
>>
>> o There seems to be 3 noticeable regressions:
>> - tbench for lower number of clients. The schedstat data shows
>> an increase in wait time.
>> - SpecJBB MultiJVM performance drops as the workload prefers
>> an idle CPU over a busy one.
>> - Unixbench-pipe benchmark performance drops.
>>
>> o Most benchmark numbers remain same.
>>
>> o Small gains seen for ycsb-mongodb and unixbench-syscall.
>>

Please ignore the last test results. The tests did not use
exactly same config for tip and sis_short kernel which led
to more overhead in the network stack for sis_short kernel
and the longer wait time seen in sched_stat data for tbench
was a result of each loop taking longer to finish.

I reran the benchmarks on the latest tip making sure the
configs are identical this time and only notice one
regression in Spec-JBB Critical-jOPS.

tl;dr

o tbench sees good improvement in the throughput when
the machine is fully loaded and beyond.
o Some unixbench test cases show improvement as well as
ycsb-mongodb in NPS2 and NPS4 mode.
o Most benchmark results are same.
o SpecJBB Critical-jOPS are still down. I'll share full
schedstat dump for tasks separately with you.

Following are the detailed results from a dual socket
Zen3 machine:

Following are the Kernel versions:

tip: 6.2.0-rc2 tip:sched/core at
commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads"
sis_short: tip + series

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip sis_short
1-groups: 4.36 (0.00 pct) 4.33 (0.68 pct)
2-groups: 5.17 (0.00 pct) 5.20 (-0.58 pct)
4-groups: 4.17 (0.00 pct) 4.19 (-0.47 pct)
8-groups: 4.64 (0.00 pct) 4.71 (-1.50 pct)
16-groups: 5.43 (0.00 pct) 6.60 (-21.54 pct) * [Machine is overloaded with avg 2.5 tasks per run queue]
16-groups: 5.90 (0.00 pct) 5.87 (0.50 pct) [Verification Run]

o NPS2

Test: tip sis_short
1-groups: 4.43 (0.00 pct) 4.18 (5.64 pct)
2-groups: 4.61 (0.00 pct) 4.99 (-8.24 pct) *
2-groups: 4.72 (0.00 pct) 4.74 (-0.42 pct) [Verification Run]
4-groups: 4.25 (0.00 pct) 4.20 (1.17 pct)
8-groups: 4.91 (0.00 pct) 5.08 (-3.46 pct)
16-groups: 5.84 (0.00 pct) 5.84 (0.00 pct)

o NPS4

Test: tip sis_short
1-groups: 4.34 (0.00 pct) 4.39 (-1.15 pct)
2-groups: 4.64 (0.00 pct) 4.97 (-7.11 pct) *
2-groups: 4.86 (0.00 pct) 4.77 (1.85 pct) [Verification Run]
4-groups: 4.20 (0.00 pct) 4.26 (-1.42 pct)
8-groups: 5.21 (0.00 pct) 5.19 (0.38 pct)
16-groups: 6.24 (0.00 pct) 5.93 (4.96 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

$ schbench -m 2 -t <workers> -s 30

o NPS1

#workers: tip sis_short
1: 36.00 (0.00 pct) 26.00 (27.77 pct)
2: 37.00 (0.00 pct) 36.00 (2.70 pct)
4: 37.00 (0.00 pct) 37.00 (0.00 pct)
8: 47.00 (0.00 pct) 49.00 (-4.25 pct)
16: 64.00 (0.00 pct) 69.00 (-7.81 pct)
32: 109.00 (0.00 pct) 118.00 (-8.25 pct) *
32: 117.00 (0.00 pct) 116.00 (0.86 pct) [Verification Run]
64: 222.00 (0.00 pct) 219.00 (1.35 pct)
128: 515.00 (0.00 pct) 513.00 (0.38 pct)
256: 39744.00 (0.00 pct) 35776.00 (9.98 pct)
512: 81280.00 (0.00 pct) 76672.00 (5.66 pct)

o NPS2

#workers: tip sis_short
1: 27.00 (0.00 pct) 25.00 (7.40 pct)
2: 31.00 (0.00 pct) 31.00 (0.00 pct)
4: 38.00 (0.00 pct) 37.00 (2.63 pct)
8: 50.00 (0.00 pct) 55.00 (-10.00 pct)
16: 66.00 (0.00 pct) 66.00 (0.00 pct)
32: 116.00 (0.00 pct) 119.00 (-2.58 pct)
64: 210.00 (0.00 pct) 219.00 (-4.28 pct)
128: 523.00 (0.00 pct) 497.00 (4.97 pct)
256: 44864.00 (0.00 pct) 46656.00 (-3.99 pct)
512: 78464.00 (0.00 pct) 79488.00 (-1.30 pct)

o NPS4

#workers: tip sis_short
1: 32.00 (0.00 pct) 34.00 (-6.25 pct)
2: 32.00 (0.00 pct) 35.00 (-9.37 pct)
4: 34.00 (0.00 pct) 39.00 (-14.70 pct)
8: 58.00 (0.00 pct) 51.00 (12.06 pct)
16: 67.00 (0.00 pct) 70.00 (-4.47 pct)
32: 118.00 (0.00 pct) 123.00 (-4.23 pct)
64: 224.00 (0.00 pct) 227.00 (-1.33 pct)
128: 533.00 (0.00 pct) 537.00 (-0.75 pct)
256: 43456.00 (0.00 pct) 48192.00 (-10.89 pct) * [Machine overloaded - avg 2 tasks per run queue]
256: 46911.59 (0.00 pct) 47163.28 (-0.53 pct) [Verification Run]
512: 78976.00 (0.00 pct) 78976.00 (0.00 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip sis_short
1 539.96 (0.00 pct) 532.63 (-1.35 pct)
2 1068.21 (0.00 pct) 1057.35 (-1.01 pct)
4 1994.76 (0.00 pct) 2015.05 (1.01 pct)
8 3602.30 (0.00 pct) 3598.70 (-0.09 pct)
16 6075.49 (0.00 pct) 6019.96 (-0.91 pct)
32 11641.07 (0.00 pct) 11774.03 (1.14 pct)
64 21529.16 (0.00 pct) 21392.97 (-0.63 pct)
128 30852.92 (0.00 pct) 31355.87 (1.63 pct)
256 51901.20 (0.00 pct) 54896.08 (5.77 pct)
512 46797.40 (0.00 pct) 55090.07 (17.72 pct)
1024 46057.28 (0.00 pct) 54374.89 (18.05 pct)

o NPS2

Clients: tip sis_short
1 536.11 (0.00 pct) 542.14 (1.12 pct)
2 1044.58 (0.00 pct) 1057.98 (1.28 pct)
4 2043.92 (0.00 pct) 1981.64 (-3.04 pct)
8 3572.50 (0.00 pct) 3579.03 (0.18 pct)
16 6040.97 (0.00 pct) 5946.20 (-1.56 pct)
32 10794.10 (0.00 pct) 11348.54 (5.13 pct)
64 20905.89 (0.00 pct) 21340.31 (2.07 pct)
128 30885.39 (0.00 pct) 30834.59 (-0.16 pct)
256 48901.25 (0.00 pct) 51905.17 (6.14 pct)
512 49673.91 (0.00 pct) 53608.18 (7.92 pct)
1024 47626.34 (0.00 pct) 53396.88 (12.11 pct)

o NPS4

Clients: tip sis_short
1 544.91 (0.00 pct) 542.78 (-0.39 pct)
2 1046.49 (0.00 pct) 1057.16 (1.01 pct)
4 2007.11 (0.00 pct) 2001.21 (-0.29 pct)
8 3590.66 (0.00 pct) 3427.33 (-4.54 pct)
16 5956.60 (0.00 pct) 5898.69 (-0.97 pct)
32 10431.73 (0.00 pct) 10732.48 (2.88 pct)
64 21563.37 (0.00 pct) 19141.76 (-11.23 pct) *
64 21140.75 (0.00 pct) 20883.78 (-1.21 pct) [Verification Run]
128 30352.16 (0.00 pct) 28757.44 (-5.25 pct) *
128 29537.66 (0.00 pct) 29488.04 (-0.16 pct) [Verification Run]
256 49504.51 (0.00 pct) 52492.40 (6.03 pct)
512 44916.61 (0.00 pct) 52746.38 (17.43 pct)
1024 49986.21 (0.00 pct) 53169.62 (6.36 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test: tip sis_short
Copy: 336796.79 (0.00 pct) 336858.84 (0.01 pct)
Scale: 212768.55 (0.00 pct) 213061.98 (0.13 pct)
Add: 244000.34 (0.00 pct) 237874.08 (-2.51 pct)
Triad: 255042.52 (0.00 pct) 250122.34 (-1.92 pct)

- 100 Runs:

Test: tip sis_short
Copy: 335938.02 (0.00 pct) 324841.05 (-3.30 pct)
Scale: 212597.92 (0.00 pct) 211516.93 (-0.50 pct)
Add: 248294.62 (0.00 pct) 241706.28 (-2.65 pct)
Triad: 258400.88 (0.00 pct) 251427.43 (-2.69 pct)

o NPS2

- 10 Runs:

Test: tip sis_short
Copy: 340709.53 (0.00 pct) 338797.30 (-0.56 pct)
Scale: 216849.08 (0.00 pct) 218167.05 (0.60 pct)
Add: 257761.46 (0.00 pct) 258717.38 (0.37 pct)
Triad: 268615.11 (0.00 pct) 270284.11 (0.62 pct)

- 100 Runs:

Test: tip sis_short
Copy: 326385.13 (0.00 pct) 314299.63 (-3.70 pct)
Scale: 216440.37 (0.00 pct) 213573.71 (-1.32 pct)
Add: 255062.22 (0.00 pct) 250837.63 (-1.65 pct)
Triad: 265442.03 (0.00 pct) 259851.69 (-2.10 pct)

o NPS4

- 10 Runs:

Test: tip sis_short
Copy: 363927.86 (0.00 pct) 364556.47 (0.17 pct)
Scale: 238190.49 (0.00 pct) 245339.08 (3.00 pct)
Add: 262806.49 (0.00 pct) 270349.31 (2.87 pct)
Triad: 276492.33 (0.00 pct) 274536.47 (-0.70 pct)

- 100 Runs:

Test: tip sis_short
Copy: 369197.55 (0.00 pct) 365775.06 (-0.92 pct)
Scale: 250508.46 (0.00 pct) 251164.01 (0.26 pct)
Add: 267792.19 (0.00 pct) 268477.42 (0.25 pct)
Triad: 280010.98 (0.00 pct) 272448.39 (-2.70 pct)


~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip: 131328.67 (var: 2.97%)
sis_short: 130867.33 (var: 3.23%) (-0.35%)

o NPS2:

tip: 132819.67 (var: 0.85%)
sis_short: 135295.33 (var: 2.02%) (+1.86%)

o NPS4:

tip: 134130.00 (var: 4.12%)
sis_short: 138018.67 (var: 1.92%) (+2.89%)


~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb MultiJVM - NPS1 ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Kernel tip sis_short
Max-jOPS 100.00% 100.00%
Critical-jOPS 100.00% 95.62% *

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48518455.50 ( 0.00%) 48698507.20 ( 0.37%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6268185467.60 ( 0.00%) 6274727387.00 ( 0.10%)
unixbench-syscall Amean unixbench-syscall-1 2685321.17 ( 0.00%) 2701231.47 * 0.59%*
unixbench-syscall Amean unixbench-syscall-512 7291476.20 ( 0.00%) 7886916.33 * 8.17%*
unixbench-pipe Hmean unixbench-pipe-1 2480858.53 ( 0.00%) 2585426.98 * 4.22%*
unixbench-pipe Hmean unixbench-pipe-512 300739256.62 ( 0.00%) 306860971.25 * 2.04%*
unixbench-spawn Hmean unixbench-spawn-1 4358.14 ( 0.00%) 4393.55 ( 0.81%)
unixbench-spawn Hmean unixbench-spawn-512 76497.32 ( 0.00%) 76175.34 * -0.42%*
unixbench-execl Hmean unixbench-execl-1 4147.12 ( 0.00%) 4164.73 * 0.42%*
unixbench-execl Hmean unixbench-execl-512 12435.26 ( 0.00%) 12669.59 ( 1.88%)

o NPS2

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48872335.50 ( 0.00%) 48880544.40 ( 0.02%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6264134378.20 ( 0.00%) 6269767014.50 ( 0.09%)
unixbench-syscall Amean unixbench-syscall-1 2683903.13 ( 0.00%) 2698851.60 * 0.56%*
unixbench-syscall Amean unixbench-syscall-512 7746773.60 ( 0.00%) 7767971.50 ( 0.27%)
unixbench-pipe Hmean unixbench-pipe-1 2476724.23 ( 0.00%) 2587534.80 * 4.47%*
unixbench-pipe Hmean unixbench-pipe-512 300277350.41 ( 0.00%) 306600469.40 * 2.11%*
unixbench-spawn Hmean unixbench-spawn-1 5026.50 ( 0.00%) 4765.11 ( -5.20%) *
unixbench-spawn Hmean unixbench-spawn-1 4965.80 ( 0.00%) 5283.00 ( 6.40%) [Verification Run]
unixbench-spawn Hmean unixbench-spawn-512 80375.59 ( 0.00%) 79331.12 * -1.30%*
unixbench-execl Hmean unixbench-execl-1 4151.70 ( 0.00%) 4139.06 ( -0.30%)
unixbench-execl Hmean unixbench-execl-512 13605.15 ( 0.00%) 11898.26 ( -12.55%) *
unixbench-execl Hmean unixbench-execl-512 12617.90 ( 0.00%) 13735.00 ( 8.85%) [Verification Run]

o NPS4

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48506771.20 ( 0.00%) 48886194.50 ( 0.78%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6280954362.50 ( 0.00%) 6289332433.50 ( 0.13%)
unixbench-syscall Amean unixbench-syscall-1 2687259.30 ( 0.00%) 2700170.93 * 0.48%*
unixbench-syscall Amean unixbench-syscall-512 7350275.67 ( 0.00%) 7858736.83 * 6.92%*
unixbench-pipe Hmean unixbench-pipe-1 2478893.01 ( 0.00%) 2585741.42 * 4.31%*
unixbench-pipe Hmean unixbench-pipe-512 301830155.61 ( 0.00%) 307556537.55 * 1.90%*
unixbench-spawn Hmean unixbench-spawn-1 5208.55 ( 0.00%) 5280.85 ( 1.39%)
unixbench-spawn Hmean unixbench-spawn-512 80745.79 ( 0.00%) 80411.55 * -0.41%*
unixbench-execl Hmean unixbench-execl-1 4072.72 ( 0.00%) 4152.37 * 1.96%*
unixbench-execl Hmean unixbench-execl-512 13746.56 ( 0.00%) 13247.30 ( -3.63%) *
unixbench-execl Hmean unixbench-execl-512 13797.40 ( 0.00%) 13624.00 ( -1.25%) [Verification Run]

> [..snip..]
>>
>> All numbers are with turbo and C2 enabled. I wonder if the
>> check "(5 * nr < 3 * sd->span_weight)" in v2 helped workloads
>> like tbench and SpecJBB. I'll queue some runs with the condition
>> added back and separate run with turbo and C2 disabled to see
>> if they helps. I'll update the thread once the results are in.
> Thanks for helping check if the nr part in v2 could bring the improvement
> back. However Peter seems to have concern regarding the nr check, I'll
> think about it more.

SpecJBB Critical-jOPS performance is known to suffer when tasks
queue behind each other. I'll share the data separately. I do see
the average wait_sum go up 1.3%. The Max-jOPS throughput, however,
is identical on both kernels which means sis_short does not affect
the overall throughput but only for the critical jobs, do we see
the regression due to possible queuing of tasks.

>
> [..snip..]
>
--
Thanks and Regards,
Prateek