Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg

From: K Prateek Nayak
Date: Mon May 16 2022 - 06:53:09 EST

Next message: Mel Gorman: "Re: [PATCH 0/6] Drain remote per-cpu directly v3"
Previous message: Hsin-Yi Wang: "[PATCH 2/2] squashfs: implement readahead"
Next in thread: Chen Yu: "Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Chenyu,

Thank you taking a look at the results.

On 5/14/2022 4:25 PM, Chen Yu wrote:
> [..snip..]
> May I know if in all NPS mode, all LLC domains have 16 CPUs?
Yes. The number of CPUs in LLC domain is always 16 irrespective of the NPS mode.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>>     Total 2 NUMA nodes in the dual socket machine.
>>
>>     Node 0: 0-63,   128-191
>>     Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>     Total 4 NUMA nodes exist over 2 socket.
>>
>>     Node 0: 0-31,   128-159
>>     Node 1: 32-63, 160-191
>>     Node 2: 64-95, 192-223
>>     Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>     Total 8 NUMA nodes exist over 2 socket.
>>
>>     Node 0: 0-15,    128-143
>>     Node 1: 16-31,   144-159
>>     Node 2: 32-47,   160-175
>>     Node 3: 48-63,   176-191
>>     Node 4: 64-79,   192-207
>>     Node 5: 80-95,   208-223
>>     Node 6: 96-111, 223-231
>>     Node 7: 112-127, 232-255
>>
>> Kernel versions:
>> - tip:      5.18-rc1 tip sched/core
>> - SIS_UTIL:    5.18-rc1 tip sched/core + this patch
>>
>> When we began testing, tip was at:
>>
>> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
>>
>> Following are the results from the benchmark:
>>
>> * - Data points of concern
>>
>> ~~~~~~~~~
>> hackbench
>> ~~~~~~~~~
>>
>> NPS1
>>
>> Test:                   tip                     SIS_UTIL
>> 1-groups:         4.64 (0.00 pct)         4.70 (-1.29 pct)
>> 2-groups:         5.38 (0.00 pct)         5.45 (-1.30 pct)
>> 4-groups:         6.15 (0.00 pct)         6.10 (0.81 pct)
>> 8-groups:         7.42 (0.00 pct)         7.42 (0.00 pct)
>> 16-groups:        10.70 (0.00 pct)        11.69 (-9.25 pct) *
>>
>> NPS2
>>
>> Test:                   tip                     SIS_UTIL
>> 1-groups:         4.70 (0.00 pct)         4.70 (0.00 pct)
>> 2-groups:         5.45 (0.00 pct)         5.46 (-0.18 pct)
>> 4-groups:         6.13 (0.00 pct)         6.05 (1.30 pct)
>> 8-groups:         7.30 (0.00 pct)         7.05 (3.42 pct)
>> 16-groups:        10.30 (0.00 pct)        10.12 (1.74 pct)
>>
>> NPS4
>>
>> Test:                   tip                     SIS_UTIL
>> 1-groups:         4.60 (0.00 pct)         4.75 (-3.26 pct) *
>> 2-groups:         5.41 (0.00 pct)         5.42 (-0.18 pct)
>> 4-groups:         6.12 (0.00 pct)         6.00 (1.96 pct)
>> 8-groups:         7.22 (0.00 pct)         7.10 (1.66 pct)
>> 16-groups:        10.24 (0.00 pct)        10.11 (1.26 pct)
>>
>> ~~~~~~~~
>> schbench
>> ~~~~~~~~
>>
>> NPS 1
>>
>> #workers:   tip                     SIS_UTIL
>> 1:      29.00 (0.00 pct)        21.00 (27.58 pct)
>> 2:      28.00 (0.00 pct)        28.00 (0.00 pct)
>> 4:      31.50 (0.00 pct)        31.00 (1.58 pct)
>> 8:      42.00 (0.00 pct)        39.00 (7.14 pct)
>> 16:      56.50 (0.00 pct)        54.50 (3.53 pct)
>> 32:      94.50 (0.00 pct)        94.00 (0.52 pct)
>> 64:     176.00 (0.00 pct)       175.00 (0.56 pct)
>> 128:     404.00 (0.00 pct)       394.00 (2.47 pct)
>> 256:     869.00 (0.00 pct)       863.00 (0.69 pct)
>> 512:     58432.00 (0.00 pct)     55424.00 (5.14 pct)
>>
>> NPS2
>>
>> #workers:      tip                     SIS_UTIL
>> 1:      26.50 (0.00 pct)        25.00 (5.66 pct)
>> 2:      26.50 (0.00 pct)        25.50 (3.77 pct)
>> 4:      34.50 (0.00 pct)        34.00 (1.44 pct)
>> 8:      45.00 (0.00 pct)        46.00 (-2.22 pct)
>> 16:      56.50 (0.00 pct)        60.50 (-7.07 pct)        *
>> 32:      95.50 (0.00 pct)        93.00 (2.61 pct)
>> 64:     179.00 (0.00 pct)       179.00 (0.00 pct)
>> 128:     369.00 (0.00 pct)       376.00 (-1.89 pct)
>> 256:     898.00 (0.00 pct)       903.00 (-0.55 pct)
>> 512:     56256.00 (0.00 pct)     57088.00 (-1.47 pct)
>>
>> NPS4
>>
>> #workers:    tip                     SIS_UTIL
>> 1:      25.00 (0.00 pct)        21.00 (16.00 pct)
>> 2:      28.00 (0.00 pct)        24.00 (14.28 pct)
>> 4:      29.50 (0.00 pct)        29.50 (0.00 pct)
>> 8:      41.00 (0.00 pct)        37.50 (8.53 pct)
>> 16:      65.50 (0.00 pct)        64.00 (2.29 pct)
>> 32:      93.00 (0.00 pct)        94.50 (-1.61 pct)
>> 64:     170.50 (0.00 pct)       175.50 (-2.93 pct)
>> 128:     377.00 (0.00 pct)       368.50 (2.25 pct)
>> 256:     867.00 (0.00 pct)       902.00 (-4.03 pct)
>> 512:     58048.00 (0.00 pct)     55488.00 (4.41 pct)
>>
>> ~~~~~~
>> tbench
>> ~~~~~~
>>
>> NPS 1
>>
>> Clients:     tip                     SIS_UTIL
>>     1    443.31 (0.00 pct)       456.19 (2.90 pct)
>>     2    877.32 (0.00 pct)       875.24 (-0.23 pct)
>>     4    1665.11 (0.00 pct)      1647.31 (-1.06 pct)
>>     8    3016.68 (0.00 pct)      2993.23 (-0.77 pct)
>>    16    5374.30 (0.00 pct)      5246.93 (-2.36 pct)
>>    32    8763.86 (0.00 pct)      7878.18 (-10.10 pct)     *
>>    64    15786.93 (0.00 pct)     12958.47 (-17.91 pct)    *
>> 128    26826.08 (0.00 pct)     26741.14 (-0.31 pct)
>> 256    24207.35 (0.00 pct)     52041.89 (114.98 pct)
>> 512    51740.58 (0.00 pct)     52084.44 (0.66 pct)
>> 1024    51177.82 (0.00 pct)     53126.29 (3.80 pct)
>>
>> NPS 2
>>
>> Clients:     tip                     SIS_UTIL
>>     1    449.49 (0.00 pct)       447.96 (-0.34 pct)
>>     2    867.28 (0.00 pct)       869.52 (0.25 pct)
>>     4    1643.60 (0.00 pct)      1625.91 (-1.07 pct)
>>     8    3047.35 (0.00 pct)      2952.82 (-3.10 pct)
>>    16    5340.77 (0.00 pct)      5251.41 (-1.67 pct)
>>    32    10536.85 (0.00 pct)     8843.49 (-16.07 pct)     *
>>    64    16543.23 (0.00 pct)     14265.35 (-13.76 pct)    *
>> 128    26400.40 (0.00 pct)     25595.42 (-3.04 pct)
>> 256    23436.75 (0.00 pct)     47090.03 (100.92 pct)
>> 512    50902.60 (0.00 pct)     50036.58 (-1.70 pct)
>> 1024    50216.10 (0.00 pct)     50639.74 (0.84 pct)
>>
>> NPS 4
>>
>> Clients:     tip                     SIS_UTIL
>>     1    443.82 (0.00 pct)       459.93 (3.62 pct)
>>     2    849.14 (0.00 pct)       882.17 (3.88 pct)
>>     4    1603.26 (0.00 pct)      1629.64 (1.64 pct)
>>     8    2972.37 (0.00 pct)      3003.09 (1.03 pct)
>>    16    5277.13 (0.00 pct)      5234.07 (-0.81 pct)
>>    32    9744.73 (0.00 pct)      9347.90 (-4.07 pct)      *
>>    64    15854.80 (0.00 pct)     14180.27 (-10.56 pct)    *
>> 128    26116.97 (0.00 pct)     24597.45 (-5.81 pct)     *
>> 256    22403.25 (0.00 pct)     47385.09 (111.50 pct)
>> 512    48317.20 (0.00 pct)     49781.02 (3.02 pct)
>> 1024    50445.41 (0.00 pct)     51607.53 (2.30 pct)
>>
>> ~~~~~~
>> Stream
>> ~~~~~~
>>
>> - 10 runs
>>
>> NPS1
>>
>>               tip                     SIS_UTIL
>> Copy:   189113.11 (0.00 pct)    188490.27 (-0.32 pct)
>> Scale:   201190.61 (0.00 pct)    204526.15 (1.65 pct)
>> Add:   232654.21 (0.00 pct)    234948.01 (0.98 pct)
>> Triad:   226583.57 (0.00 pct)    228844.43 (0.99 pct)
>>
>> NPS2
>>
>> Test:         tip                     SIS_UTIL
>> Copy:   155347.14 (0.00 pct)    169386.29 (9.03 pct)
>> Scale:   191701.53 (0.00 pct)    196110.51 (2.29 pct)
>> Add:   210013.97 (0.00 pct)    221088.45 (5.27 pct)
>> Triad:   207602.00 (0.00 pct)    218072.52 (5.04 pct)
>>
>> NPS4
>>
>> Test:         tip                     SIS_UTIL
>> Copy:   136421.15 (0.00 pct)    140894.11 (3.27 pct)
>> Scale:   191217.59 (0.00 pct)    190554.17 (-0.34 pct)
>> Add:   189229.52 (0.00 pct)    190871.88 (0.86 pct)
>> Triad:   188052.99 (0.00 pct)    188417.63 (0.19 pct)
>>
>> - 100 runs
>>
>> NPS1
>>
>> Test:       tip                     SIS_UTIL
>> Copy:   244693.32 (0.00 pct)    232328.05 (-5.05 pct)
>> Scale:   221874.99 (0.00 pct)    216858.39 (-2.26 pct)
>> Add:   268363.89 (0.00 pct)    265449.16 (-1.08 pct)
>> Triad:   260945.24 (0.00 pct)    252240.56 (-3.33 pct)
>>
>> NPS2
>>
>> Test:       tip                     SIS_UTIL
>> Copy:   211262.00 (0.00 pct)    225240.03 (6.61 pct)
>> Scale:   222493.34 (0.00 pct)    219094.65 (-1.52 pct)
>> Add:   280277.17 (0.00 pct)    275677.73 (-1.64 pct)
>> Triad:   265860.49 (0.00 pct)    262584.22 (-1.23 pct)
>>
>> NPS4
>>
>> Test:       tip                     SIS_UTIL
>> Copy:   250171.40 (0.00 pct)    230983.60 (-7.66 pct)
>> Scale:   222293.56 (0.00 pct)    215984.34 (-2.83 pct)
>> Add:   279222.16 (0.00 pct)    270402.64 (-3.15 pct)
>> Triad:   262013.92 (0.00 pct)    254820.60 (-2.74 pct)
>>
>> ~~~~~~~~~~~~
>> ycsb-mongodb
>> ~~~~~~~~~~~~
>>
>> NPS1
>>
>> sched-tip:      303718.33 (var: 1.31)
>> SIS_UTIL:       303529.33 (var: 0.67)    (-0.06%)
>>
>> NPS2
>>
>> sched-tip:      304536.33 (var: 2.46)
>> SIS_UTIL:       303730.33 (var: 1.57)    (-0.26%)
>>
>> NPS4
>>
>> sched-tip:      301192.33 (var: 1.81)
>> SIS_UTIL:       300101.33 (var: 0.35)   (-0.36%)
>>
>> ~~~~~~~~~~~~~~~~~~
>>
>> Notes:
>>
>> - There seems to be some noticeable regression for hackbench
>> with 16 groups in NPS1 mode.
> Did the hackbench use the default fd number(20) in every group? If
> this is the case, then there are 16 * 20 * 2 = 640 threads in the
> system. I thought this should be overloaded, either in SIS_PROP or
> SIS_UTIL, the search depth might be 4 and 0 respectively. And it
> is also very likely the SIS_PROP will not find an idle CPU after
> searching for 4 CPUs. So in theory there should be not much performance
> difference with vs without the patch applied. But if the fd number is set
> to a smaller one, the regression could be explained as you mentioned,
> SIS_PROP search more aggressively.
Yes, I'm using fd number(20). The logs from hackbench run show that it is
indeed running 640 threads with 16 groups:

# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 16 groups == 640 threads run

This is indeed counterintuitive and I don't have
a good explanation for this other than that SIS_PROP
probably finding slightly greater success at finding
an idle CPU even in this overloaded environment.

I've ran the benchmark in two sets of 3 runs rebooting
in between on each kernel version:

- tip

Test:                   tip-r0                  tip-r1                  tip-r2
1-groups:         4.64 (0.00 pct)         4.90 (-5.60 pct)        4.99 (-7.54 pct)
2-groups:         5.54 (0.00 pct)         5.56 (-0.36 pct)        5.58 (-0.72 pct)
4-groups:         6.24 (0.00 pct)         6.18 (0.96 pct)         6.20 (0.64 pct)
8-groups:         7.54 (0.00 pct)         7.50 (0.53 pct)         7.54 (0.00 pct)
16-groups:        10.85 (0.00 pct)        11.17 (-2.94 pct)       10.91 (-0.55 pct)

Test:                   tip-r3                  tip-r4                  tip-r5
1-groups:         4.68 (0.00 pct)         4.97 (-6.19 pct)        4.98 (-6.41 pct)
2-groups:         5.60 (0.00 pct)         5.62 (-0.35 pct)        5.66 (-1.07 pct)
4-groups:         6.24 (0.00 pct)         6.23 (0.16 pct)         6.24 (0.00 pct)
8-groups:         7.54 (0.00 pct)         7.50 (0.53 pct)         7.46 (1.06 pct)
16-groups:        10.81 (0.00 pct)        10.84 (-0.27 pct)       10.81 (0.00 pct)

- SIS_UTIL

Test:                SIS_UTIL-r0              SIS_UTIL-r1             SIS_UTIL-r2
1-groups:         4.68 (0.00 pct)         5.03 (-7.47 pct)        4.96 (-5.98 pct)
2-groups:         5.45 (0.00 pct)         5.48 (-0.55 pct)        5.50 (-0.91 pct)
4-groups:         6.10 (0.00 pct)         6.07 (0.49 pct)         6.14 (-0.65 pct)
8-groups:         7.52 (0.00 pct)         7.51 (0.13 pct)         7.52 (0.00 pct)
16-groups:        11.63 (0.00 pct)        11.48 (1.28 pct)        11.51 (1.03 pct)

Test:                SIS_UTIL-r3              SIS_UTIL-r4             SIS_UTIL-r5
1-groups:         4.80 (0.00 pct)         5.00 (-4.16 pct)        5.06 (-5.41 pct)
2-groups:         5.51 (0.00 pct)         5.58 (-1.27 pct)        5.58 (-1.27 pct)
4-groups:         6.14 (0.00 pct)         6.11 (0.48 pct)         6.06 (1.30 pct)
8-groups:         7.35 (0.00 pct)         7.38 (-0.40 pct)        7.40 (-0.68 pct)
16-groups:        11.03 (0.00 pct)        11.29 (-2.35 pct)       11.14 (-0.99 pct)

- Comparing the best and bad data points for 16-groups with each
kernel version:

Test:                   tip-good             SIS_UTIL-good
1-groups:         4.68 (0.00 pct)         4.80 (-2.56 pct)
2-groups:         5.60 (0.00 pct)         5.51 (1.60 pct)
4-groups:         6.24 (0.00 pct)         6.14 (1.60 pct)
8-groups:         7.54 (0.00 pct)         7.35 (2.51 pct)
16-groups:        10.81 (0.00 pct)        11.03 (-2.03 pct)

Test:                   tip-good             SIS_UTIL-bad
1-groups:         4.68 (0.00 pct)         4.68 (0.00 pct)
2-groups:         5.60 (0.00 pct)         5.45 (2.67 pct)
4-groups:         6.24 (0.00 pct)         6.10 (2.24 pct)
8-groups:         7.54 (0.00 pct)         7.52 (0.26 pct)
16-groups:        10.81 (0.00 pct)        11.63 (-7.58 pct)

Test:                   tip-bad             SIS_UTIL-good
1-groups:         4.90 (0.00 pct)         4.80 (2.04 pct)
2-groups:         5.56 (0.00 pct)         5.51 (0.89 pct)
4-groups:         6.18 (0.00 pct)         6.14 (0.64 pct)
8-groups:         7.50 (0.00 pct)         7.35 (2.00 pct)
16-groups:        11.17 (0.00 pct)        11.03 (1.25 pct)

Test:                   tip-bad             SIS_UTIL-bad
1-groups:         4.90 (0.00 pct)         4.68 (4.48 pct)
2-groups:         5.56 (0.00 pct)         5.45 (1.97 pct)
4-groups:         6.18 (0.00 pct)         6.10 (1.29 pct)
8-groups:         7.50 (0.00 pct)         7.52 (-0.26 pct)
16-groups:        11.17 (0.00 pct)        11.63 (-4.11 pct)

Hackbench consistently reports > 11 for 16-group
case with SIS_UTIL however only once with SIS_PROP

>> - There seems to be regression in tbench for case with number
>> of workers in range 32-128 (12.5% loaded to 50% loaded)
>> - tbench reaches saturation early when system is fully loaded
>>
>> This probably show that the strategy in the initial v1 RFC
>> seems to work better with our LLC where number of CPUs per LLC
>> is low compared to systems with unified LLC. Given this is
>> showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
>> can be enabled based on the the size of LLC.
>>
> Yes, SIS_PROP searches more aggressively, but we attempts to replace
> SIS_PROP with a more accurate policy.
>>> [..snip..]
>>>
>>> [3]
>>> Prateek mentioned that we should scan aggressively in an LLC domain
>>> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
>>> negligible. The current patch aims to propose a generic solution and only
>>> considers the util_avg. A follow-up change could enhance the scan policy
>>> to adjust the scan_percent according to the CPU number in LLC.
>> Following are some additional numbers I would like to share comparing SIS_PROP and
>> SIS_UTIL:
>>
> Nice analysis.
>> o Hackbench with 1 group
>>
>> With 1 group, following are the chances of SIS_PROP
>> and SIS_UTIL finding an idle CPU when an idle CPU
>> exists in LLC:
>>
>> +-----------------+---------------------------+---------------------------+--------+
>> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count |
>> +-----------------+---------------------------+---------------------------+--------+
>> |        1        |             0             |             0             | 66444 |
>> |        1        |             0             |             1             | 34153 |
>> |        1        |             1             |             0             | 57204 |
>> |        1        |             1             |             1             | 119263 |
>> +-----------------+---------------------------+---------------------------+--------+
>>
> So SIS_PROP searches more, and get higher chance to find an idle CPU in a LLC with
> 16 CPUs.
Yes!
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 90653653
>> Total time with SIS_PROP: 53558942 (-40.92 pct)
>> Total time saved: 37094711
>>
> What does no SIS_PROP mean? Is it with SIS_PROP disabled and
> SIS_UTIL enabled? Or with both SIS_PROP and SIS_UTIL disabled?
> If it is the latter, is there any performance difference between
> the two?

Sorry for not being clear here. No SIS_PROP mean we are searching the
entire LLC all the time for an idle CPU.This data aims to find how much time SIS_PROP is saving compared tocase where it is disabled.

>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+-------+
>> | CPU Searched | Count |
>> +--------------+-------+
>> |      0       | 10520 |
>> |      1       | 7770 |
>> |      2       | 11976 |
>> |      3       | 17554 |
>> |      4       | 13932 |
>> |      5       | 15051 |
>> |      6       | 8398 |
>> |      7       | 4544 |
>> |      8       | 3712 |
>> |      9       | 2337 |
>> |      10      | 4541 |
>> |      11      | 1947 |
>> |      12      | 3846 |
>> |      13      | 3645 |
>> |      14      | 2686 |
>> |      15      | 8390 |
>> |      16      | 26157 |
>> +--------------+-------+
>>
>> - SIS_UTIL might be bailing out too early in some of these cases.
>>
> Right.
>> o Hackbench with 16 group
>>
>> the success rate looks as follows:
>>
>> +-----------------+---------------------------+---------------------------+---------+
>> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count |
>> +-----------------+---------------------------+---------------------------+---------+
>> |        1        |             0             |             0             | 1313745 |
>> |        1        |             0             |             1             | 694132 |
>> |        1        |             1             |             0             | 2888450 |
>> |        1        |             1             |             1             | 5343065 |
>> +-----------------+---------------------------+---------------------------+---------+
>>
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 5227299388
>> Total time with SIS_PROP: 3866575188 (-26.03 pct)
>> Total time saved: 1360724200
>>
>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+---------+
>> | CPU Searched | Count |
>> +--------------+---------+
>> |      0       | 150351 |
>> |      1       | 105116 |
>> |      2       | 214291 |
>> |      3       | 440053 |
>> |      4       | 914116 |
>> |      5       | 1757984 |
>> |      6       | 2410484 |
>> |      7       | 1867668 |
>> |      8       | 379888 |
>> |      9       | 84055 |
>> |      10      | 55389 |
>> |      11      | 26795 |
>> |      12      | 43113 |
>> |      13      | 24579 |
>> |      14      | 32896 |
>> |      15      | 70059 |
>> |      16      | 150858 |
>> +--------------+---------+
>>
>> - SIS_UTIL might be bailing out too early in most of these cases
>>
> It might be interesting to see what the current sum of util_avg is, and this suggested that,
> even if util_avg is a little high, it might be still be worthwhile to search more CPUs.
I agree. Let me know if there is any data you would like me to collect wrt this.
>> o tbench with 256 workers
>>
>> For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
>> of CPUs to search.
>>
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 64004752959
>> Total time with SIS_PROP: 34695004390 (-45.79 pct)
>> Total time saved: 29309748569
>>
>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+----------+
>> | CPU Searched | Count   |
>> +--------------+----------+
>> |      0       | 500077 |
>> |      1       | 543865 |
>> |      2       | 4257684 |
>> |      3       | 27457498 |
>> |      4       | 40208673 |
>> |      5       | 3264358 |
>> |      6       | 191631 |
>> |      7       | 24658   |
>> |      8       |   2469   |
>> |      9       |   1374   |
>> |      10      |   2008   |
>> |      11      |   1300   |
>> |      12      |   1226   |
>> |      13      |   1179   |
>> |      14      |   1631   |
>> |      15      | 11678   |
>> |      16      |   7793   |
>> +--------------+----------+
>>
>> - This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
>> at restricting search space well.
>>
>> o Observations
>>
>> SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
>> in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
>> with 16 groups compared to than with 1 group.
>> Also SIS_PROP is more aggressive at saving time for 1-group compared to the
>> case with 16-groups.
>>
>> The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
>> performance gain in a fully loaded system.
>>
>> Note: There might be some inaccuracies for the numbers presented for metrics that
>> directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
>> when gathering these data points and the results from SIS_PROP were returned from
>> search_idle_cpu().
> Do you mean the 'CPU Searched' calculated by SIS_PROP was collected with both SIS_UTIL
> and SIS_PROP enabled?
Yes, the table
"Number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size)"
was obtained by enabling both the features - SIS_PROP and SIS_UTIL, and
comparing the nr values suggested by SIS_UTIL when SIS_PROP allowed
searching for the entire LLC.
>> All the numbers for the above analysis were gathered in NPS1 mode.
>>
> I'm thinking of taking nr_llc number into consideration to adjust the search depth,
> something like:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd52fc5a034b..39b914599dce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
> llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> nr_scan = max(nr_scan, 0);
> + if (nr_llc <= 16 && nr_scan)
> + nr_scan = nr_llc;
> +
This will behave closer to the initial RFC on systems with smaller LLC.
I can do some preliminary testing with this and get back to you.
> WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> }
>
> I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.
Thank you for looking into this.

--
Thanks and Regards,
Prateek

Next message: Mel Gorman: "Re: [PATCH 0/6] Drain remote per-cpu directly v3"
Previous message: Hsin-Yi Wang: "[PATCH 2/2] squashfs: implement readahead"
Next in thread: Chen Yu: "Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]