Re: [PATCH v3 0/8] make slab shrink lockless

From: Qi Zheng
Date: Mon Feb 27 2023 - 08:33:12 EST




On 2023/2/27 03:51, Andrew Morton wrote:
On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:

Hi all,

This patch series aims to make slab shrink lockless.

What an awesome changelog.

2. Survey
=========

Especially this part.

Looking through all the prior efforts and at this patchset I am not
immediately seeing any statements about the overall effect upon
real-world workloads. For a good example, does this patchset
measurably improve throughput or energy consumption on your servers?

Hi Andrew,

I re-tested with the following physical machines:

Architecture: x86_64
CPU(s): 96
On-line CPU(s) list: 0-95
Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz

I found that the reason for the hotspot I described in cover letter is
wrong. The reason for the down_read_trylock() hotspot is not because of
the failure to trylock, but simply because of the atomic operation
(cmpxchg). And this will lead to a significant reduction in IPC (insn
per cycle).

To verify this, I did the following tests:

1. Run the following script to create down_read_trylock() hotspots:

```
#!/bin/bash

DIR="/root/shrinker/memcg/mnt"

do_create()
{
mkdir -p /sys/fs/cgroup/memory/test
mkdir -p /sys/fs/cgroup/perf_event/test
echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir -p /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
mkdir -p $DIR/$i;
done
}

do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}

do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}

case "$1" in
touch)
do_touch $2 $3
;;
test)
do_create 4000
do_mount 0 4000
do_touch 0 3000
;;
*)
exit 1
;;
esac
```

Save the above script, then run test and touch commands.

Then we can use the following perf command to view hotspots:

perf top -U -F 999

1) Before applying this patchset:

32.31% [kernel] [k] down_read_trylock
19.40% [kernel] [k] pv_native_safe_halt
16.24% [kernel] [k] up_read
15.70% [kernel] [k] shrink_slab
4.69% [kernel] [k] _find_next_bit
2.62% [kernel] [k] shrink_node
1.78% [kernel] [k] shrink_lruvec
0.76% [kernel] [k] do_shrink_slab

2) After applying this patchset:

27.83% [kernel] [k] _find_next_bit
16.97% [kernel] [k] shrink_slab
15.82% [kernel] [k] pv_native_safe_halt
9.58% [kernel] [k] shrink_node
8.31% [kernel] [k] shrink_lruvec
5.64% [kernel] [k] do_shrink_slab
3.88% [kernel] [k] mem_cgroup_iter

2. At the same time, we use the following perf command to capture IPC
information:

perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10

1) Before applying this patchset:

Performance counter stats for 'system wide' (5 runs):

454187219766 cycles test ( +- 1.84% )
78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% )

10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )

2) After applying this patchset:

Performance counter stats for 'system wide' (5 runs):

841954709443 cycles test ( +- 15.80% ) (98.69%)
527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%)

10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )

We can see that IPC drops very seriously when calling
down_read_trylock() at high frequency. After using SRCU,
the IPC is at a normal level.

Thanks,
Qi