Re: [PATCH v3 0/8] make slab shrink lockless

From: Qi Zheng
Date: Tue Feb 28 2023 - 05:55:29 EST




On 2023/2/28 18:04, Qi Zheng wrote:


On 2023/2/27 23:08, Mike Rapoport wrote:
Hi,

On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:


On 2023/2/27 03:51, Andrew Morton wrote:
On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:

Hi all,

This patch series aims to make slab shrink lockless.

What an awesome changelog.

2. Survey
=========

Especially this part.

Looking through all the prior efforts and at this patchset I am not
immediately seeing any statements about the overall effect upon
real-world workloads.  For a good example, does this patchset
measurably improve throughput or energy consumption on your servers?

Hi Andrew,

I re-tested with the following physical machines:

Architecture:        x86_64
CPU(s):              96
On-line CPU(s) list: 0-95
Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz

I found that the reason for the hotspot I described in cover letter is
wrong. The reason for the down_read_trylock() hotspot is not because of
the failure to trylock, but simply because of the atomic operation
(cmpxchg). And this will lead to a significant reduction in IPC (insn
per cycle).

...
Then we can use the following perf command to view hotspots:

perf top -U -F 999

1) Before applying this patchset:

   32.31%  [kernel]           [k] down_read_trylock
   19.40%  [kernel]           [k] pv_native_safe_halt
   16.24%  [kernel]           [k] up_read
   15.70%  [kernel]           [k] shrink_slab
    4.69%  [kernel]           [k] _find_next_bit
    2.62%  [kernel]           [k] shrink_node
    1.78%  [kernel]           [k] shrink_lruvec
    0.76%  [kernel]           [k] do_shrink_slab

2) After applying this patchset:

   27.83%  [kernel]           [k] _find_next_bit
   16.97%  [kernel]           [k] shrink_slab
   15.82%  [kernel]           [k] pv_native_safe_halt
    9.58%  [kernel]           [k] shrink_node
    8.31%  [kernel]           [k] shrink_lruvec
    5.64%  [kernel]           [k] do_shrink_slab
    3.88%  [kernel]           [k] mem_cgroup_iter

2. At the same time, we use the following perf command to capture IPC
information:

perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10

1) Before applying this patchset:

  Performance counter stats for 'system wide' (5 runs):

       454187219766      cycles test                    (
+-  1.84% )
        78896433101      instructions              test #    0.17 insn per
cycle           ( +-  0.44% )

         10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )

2) After applying this patchset:

  Performance counter stats for 'system wide' (5 runs):

       841954709443      cycles test                    (
+- 15.80% )  (98.69%)
       527258677936      instructions              test #    0.63 insn per
cycle           ( +- 15.11% )  (98.68%)

           10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )

We can see that IPC drops very seriously when calling
down_read_trylock() at high frequency. After using SRCU,
the IPC is at a normal level.

The results you present do show improvement in IPC for an artificial test
script. But more interesting would be to see how a real world workloads
benefit from your changes.

Hi Mike and Andrew,

I did encounter this problem under the real workload of our online
server. At the end of this email, I posted another call stack and
hot spot that I found before.

I scanned the hotspots of all our online servers yesterday and today, but unfortunately did not find the live environment.

Some of our servers have a large number of containers, and each
container will mount some file systems. This is likely to trigger
down_read_trylock() hotspots when the memory pressure of the whole
machine or the memory pressure of memcg is high.

And the servers where this hotspot has happened (we have a hotspot alarm
record), basically have 96 cores, or 128 cores or even more.


So I just found a physical server with a similar configuration to the
online server yesterday for a simulation test. The call stack and the hot spot in the simulation test are almost exactly the same, so in
theory, when such a hot spot appears on the online server, we can also
enjoy the improvement of IPC. This will improve the performance of the
server in memory exhaustion scenarios (memcg or global level).

And the above scenario is only one aspect, and the other aspect is the
lock competition scenario mentioned by Kirill. After applying this patch set, slab shrink and register_shrinker() can be completely parallelized,
which can fix that problem.

These are the two main benefits for real workloads that I consider.

Thanks,
Qi

call stack
----------

@[
    down_read_trylock+1
    shrink_slab+128
    shrink_node+371
    do_try_to_free_pages+232
    try_to_free_pages+243
    _alloc_pages_slowpath+771
    _alloc_pages_nodemask+702
    pagecache_get_page+255
    filemap_fault+1361
    ext4_filemap_fault+44
    __do_fault+76
    handle_mm_fault+3543
    do_user_addr_fault+442
    do_page_fault+48
    page_fault+62
]: 1161690
@[
    down_read_trylock+1
    shrink_slab+128
    shrink_node+371
    balance_pgdat+690
    kswapd+389
    kthread+246
    ret_from_fork+31
]: 8424884
@[
    down_read_trylock+1
    shrink_slab+128
    shrink_node+371
    do_try_to_free_pages+232
    try_to_free_pages+243
    __alloc_pages_slowpath+771
    __alloc_pages_nodemask+702
    __do_page_cache_readahead+244
    filemap_fault+1674
    ext4_filemap_fault+44
    __do_fault+76
    handle_mm_fault+3543
    do_user_addr_fault+442
    do_page_fault+48
    page_fault+62
]: 20917631

hotspot
-------

52.22% [kernel]        [k] down_read_trylock
19.60% [kernel]        [k] up_read
 8.86% [kernel]        [k] shrink_slab
 2.44% [kernel]        [k] idr_find
 1.25% [kernel]        [k] count_shadow_nodes
 1.18% [kernel]        [k] shrink lruvec
 0.71% [kernel]        [k] mem_cgroup_iter
 0.71% [kernel]        [k] shrink_node
 0.55% [kernel]        [k] find_next_bit


Thanks,
Qi



--
Thanks,
Qi