Re: [PATCH v3 0/8] make slab shrink lockless

From: Roman Gushchin
Date: Mon Feb 27 2023 - 14:32:48 EST


On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote:
> On 27.02.2023 18:08, Mike Rapoport wrote:
> > Hi,
> >
> > On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
> >>
> >>
> >> On 2023/2/27 03:51, Andrew Morton wrote:
> >>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> This patch series aims to make slab shrink lockless.
> >>>
> >>> What an awesome changelog.
> >>>
> >>>> 2. Survey
> >>>> =========
> >>>
> >>> Especially this part.
> >>>
> >>> Looking through all the prior efforts and at this patchset I am not
> >>> immediately seeing any statements about the overall effect upon
> >>> real-world workloads. For a good example, does this patchset
> >>> measurably improve throughput or energy consumption on your servers?
> >>
> >> Hi Andrew,
> >>
> >> I re-tested with the following physical machines:
> >>
> >> Architecture: x86_64
> >> CPU(s): 96
> >> On-line CPU(s) list: 0-95
> >> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
> >>
> >> I found that the reason for the hotspot I described in cover letter is
> >> wrong. The reason for the down_read_trylock() hotspot is not because of
> >> the failure to trylock, but simply because of the atomic operation
> >> (cmpxchg). And this will lead to a significant reduction in IPC (insn
> >> per cycle).
> >
> > ...
> >
> >> Then we can use the following perf command to view hotspots:
> >>
> >> perf top -U -F 999
> >>
> >> 1) Before applying this patchset:
> >>
> >> 32.31% [kernel] [k] down_read_trylock
> >> 19.40% [kernel] [k] pv_native_safe_halt
> >> 16.24% [kernel] [k] up_read
> >> 15.70% [kernel] [k] shrink_slab
> >> 4.69% [kernel] [k] _find_next_bit
> >> 2.62% [kernel] [k] shrink_node
> >> 1.78% [kernel] [k] shrink_lruvec
> >> 0.76% [kernel] [k] do_shrink_slab
> >>
> >> 2) After applying this patchset:
> >>
> >> 27.83% [kernel] [k] _find_next_bit
> >> 16.97% [kernel] [k] shrink_slab
> >> 15.82% [kernel] [k] pv_native_safe_halt
> >> 9.58% [kernel] [k] shrink_node
> >> 8.31% [kernel] [k] shrink_lruvec
> >> 5.64% [kernel] [k] do_shrink_slab
> >> 3.88% [kernel] [k] mem_cgroup_iter
> >>
> >> 2. At the same time, we use the following perf command to capture IPC
> >> information:
> >>
> >> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
> >>
> >> 1) Before applying this patchset:
> >>
> >> Performance counter stats for 'system wide' (5 runs):
> >>
> >> 454187219766 cycles test (
> >> +- 1.84% )
> >> 78896433101 instructions test # 0.17 insn per
> >> cycle ( +- 0.44% )
> >>
> >> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
> >>
> >> 2) After applying this patchset:
> >>
> >> Performance counter stats for 'system wide' (5 runs):
> >>
> >> 841954709443 cycles test (
> >> +- 15.80% ) (98.69%)
> >> 527258677936 instructions test # 0.63 insn per
> >> cycle ( +- 15.11% ) (98.68%)
> >>
> >> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
> >>
> >> We can see that IPC drops very seriously when calling
> >> down_read_trylock() at high frequency. After using SRCU,
> >> the IPC is at a normal level.
> >
> > The results you present do show improvement in IPC for an artificial test
> > script. But more interesting would be to see how a real world workloads
> > benefit from your changes.
>
> One of the real workloads from my experience is start of an overcommitted node
> containing many starting containers after node crash (or many resuming containers
> after reboot for kernel update). In these cases memory pressure is huge, and
> the node goes round in long reclaim.
>
> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
> current bit iteration, sees rwsem_is_contended() and the iteration breaks.
>
> Also, it's important to mention that currently we have the strange behavior:
>
> prealloc_memcg_shrinker()
> down_write(&shrinker_rwsem)
> idr_alloc()
> reclaim
> for each child memcg
> shrink_slab_memcg()
> down_read_trylock(&shrinker_rwsem) -> fail

But this can happen only if we get -ENOMEM in idr_alloc()?
Doesn't seem to be a very hot path.

Thanks!