Re: [PATCH v3 0/8] make slab shrink lockless

From: Kirill Tkhai
Date: Mon Feb 27 2023 - 14:47:59 EST


On 27.02.2023 22:32, Roman Gushchin wrote:
> On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote:
>> On 27.02.2023 18:08, Mike Rapoport wrote:
>>> Hi,
>>>
>>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2023/2/27 03:51, Andrew Morton wrote:
>>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patch series aims to make slab shrink lockless.
>>>>>
>>>>> What an awesome changelog.
>>>>>
>>>>>> 2. Survey
>>>>>> =========
>>>>>
>>>>> Especially this part.
>>>>>
>>>>> Looking through all the prior efforts and at this patchset I am not
>>>>> immediately seeing any statements about the overall effect upon
>>>>> real-world workloads. For a good example, does this patchset
>>>>> measurably improve throughput or energy consumption on your servers?
>>>>
>>>> Hi Andrew,
>>>>
>>>> I re-tested with the following physical machines:
>>>>
>>>> Architecture: x86_64
>>>> CPU(s): 96
>>>> On-line CPU(s) list: 0-95
>>>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>>>
>>>> I found that the reason for the hotspot I described in cover letter is
>>>> wrong. The reason for the down_read_trylock() hotspot is not because of
>>>> the failure to trylock, but simply because of the atomic operation
>>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>>>> per cycle).
>>>
>>> ...
>>>
>>>> Then we can use the following perf command to view hotspots:
>>>>
>>>> perf top -U -F 999
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>> 32.31% [kernel] [k] down_read_trylock
>>>> 19.40% [kernel] [k] pv_native_safe_halt
>>>> 16.24% [kernel] [k] up_read
>>>> 15.70% [kernel] [k] shrink_slab
>>>> 4.69% [kernel] [k] _find_next_bit
>>>> 2.62% [kernel] [k] shrink_node
>>>> 1.78% [kernel] [k] shrink_lruvec
>>>> 0.76% [kernel] [k] do_shrink_slab
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>> 27.83% [kernel] [k] _find_next_bit
>>>> 16.97% [kernel] [k] shrink_slab
>>>> 15.82% [kernel] [k] pv_native_safe_halt
>>>> 9.58% [kernel] [k] shrink_node
>>>> 8.31% [kernel] [k] shrink_lruvec
>>>> 5.64% [kernel] [k] do_shrink_slab
>>>> 3.88% [kernel] [k] mem_cgroup_iter
>>>>
>>>> 2. At the same time, we use the following perf command to capture IPC
>>>> information:
>>>>
>>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>> Performance counter stats for 'system wide' (5 runs):
>>>>
>>>> 454187219766 cycles test (
>>>> +- 1.84% )
>>>> 78896433101 instructions test # 0.17 insn per
>>>> cycle ( +- 0.44% )
>>>>
>>>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>> Performance counter stats for 'system wide' (5 runs):
>>>>
>>>> 841954709443 cycles test (
>>>> +- 15.80% ) (98.69%)
>>>> 527258677936 instructions test # 0.63 insn per
>>>> cycle ( +- 15.11% ) (98.68%)
>>>>
>>>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>>>>
>>>> We can see that IPC drops very seriously when calling
>>>> down_read_trylock() at high frequency. After using SRCU,
>>>> the IPC is at a normal level.
>>>
>>> The results you present do show improvement in IPC for an artificial test
>>> script. But more interesting would be to see how a real world workloads
>>> benefit from your changes.
>>
>> One of the real workloads from my experience is start of an overcommitted node
>> containing many starting containers after node crash (or many resuming containers
>> after reboot for kernel update). In these cases memory pressure is huge, and
>> the node goes round in long reclaim.
>>
>> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
>> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
>> current bit iteration, sees rwsem_is_contended() and the iteration breaks.
>>
>> Also, it's important to mention that currently we have the strange behavior:
>>
>> prealloc_memcg_shrinker()
>> down_write(&shrinker_rwsem)
>> idr_alloc()
>> reclaim
>> for each child memcg
>> shrink_slab_memcg()
>> down_read_trylock(&shrinker_rwsem) -> fail
>
> But this can happen only if we get -ENOMEM in idr_alloc()?
> Doesn't seem to be a very hot path.

There is not only idr_alloc(), but expand_shrinker_info() too. The last is more heavier.
But despite that, yes, it's not a hot path.

The memory pressure on overcommited node start I described above is a regular situation.
There are lots of register_shrinker() contending with reclaim.