Re: [PATCH v3 0/8] make slab shrink lockless

From: Mike Rapoport
Date: Mon Feb 27 2023 - 10:08:50 EST


Hi,

On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>
>
> On 2023/2/27 03:51, Andrew Morton wrote:
> > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
> >
> > > Hi all,
> > >
> > > This patch series aims to make slab shrink lockless.
> >
> > What an awesome changelog.
> >
> > > 2. Survey
> > > =========
> >
> > Especially this part.
> >
> > Looking through all the prior efforts and at this patchset I am not
> > immediately seeing any statements about the overall effect upon
> > real-world workloads. For a good example, does this patchset
> > measurably improve throughput or energy consumption on your servers?
>
> Hi Andrew,
>
> I re-tested with the following physical machines:
>
> Architecture: x86_64
> CPU(s): 96
> On-line CPU(s) list: 0-95
> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>
> I found that the reason for the hotspot I described in cover letter is
> wrong. The reason for the down_read_trylock() hotspot is not because of
> the failure to trylock, but simply because of the atomic operation
> (cmpxchg). And this will lead to a significant reduction in IPC (insn
> per cycle).

...

> Then we can use the following perf command to view hotspots:
>
> perf top -U -F 999
>
> 1) Before applying this patchset:
>
> 32.31% [kernel] [k] down_read_trylock
> 19.40% [kernel] [k] pv_native_safe_halt
> 16.24% [kernel] [k] up_read
> 15.70% [kernel] [k] shrink_slab
> 4.69% [kernel] [k] _find_next_bit
> 2.62% [kernel] [k] shrink_node
> 1.78% [kernel] [k] shrink_lruvec
> 0.76% [kernel] [k] do_shrink_slab
>
> 2) After applying this patchset:
>
> 27.83% [kernel] [k] _find_next_bit
> 16.97% [kernel] [k] shrink_slab
> 15.82% [kernel] [k] pv_native_safe_halt
> 9.58% [kernel] [k] shrink_node
> 8.31% [kernel] [k] shrink_lruvec
> 5.64% [kernel] [k] do_shrink_slab
> 3.88% [kernel] [k] mem_cgroup_iter
>
> 2. At the same time, we use the following perf command to capture IPC
> information:
>
> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>
> 1) Before applying this patchset:
>
> Performance counter stats for 'system wide' (5 runs):
>
> 454187219766 cycles test (
> +- 1.84% )
> 78896433101 instructions test # 0.17 insn per
> cycle ( +- 0.44% )
>
> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>
> 2) After applying this patchset:
>
> Performance counter stats for 'system wide' (5 runs):
>
> 841954709443 cycles test (
> +- 15.80% ) (98.69%)
> 527258677936 instructions test # 0.63 insn per
> cycle ( +- 15.11% ) (98.68%)
>
> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>
> We can see that IPC drops very seriously when calling
> down_read_trylock() at high frequency. After using SRCU,
> the IPC is at a normal level.

The results you present do show improvement in IPC for an artificial test
script. But more interesting would be to see how a real world workloads
benefit from your changes.

> Thanks,
> Qi

--
Sincerely yours,
Mike.