RE: [PATCH v2 00/19] crypto: x86 - fix RCU stalls
From: Elliott, Robert (Servers)
Date: Tue Nov 01 2022 - 17:35:15 EST
> -----Original Message-----
> From: Elliott, Robert (Servers) <elliott@xxxxxxx>
> Sent: Wednesday, October 12, 2022 4:59 PM
> To: herbert@xxxxxxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx;
> tim.c.chen@xxxxxxxxxxxxxxx; ap420073@xxxxxxxxx; ardb@xxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Cc: Elliott, Robert (Servers) <elliott@xxxxxxx>
> Subject: [PATCH v2 00/19] crypto: x86 - fix RCU stalls
>
> This series fixes the RCU stalls triggered by the x86 crypto
> modules discussed in
> https://lore.kernel.org/all/MW5PR84MB18426EBBA3303770A8BC0BDFAB759@MW5PR84
> MB1842.NAMPRD84.PROD.OUTLOOK.COM/
I've instrumented all the x86 crypto modules, including ways to
experiment with different loop sizes. Here are some results with
the hash functions.
Key:
calls = number of kernel_fpu_begin()/end() calls made by the module
cost = number of CPU cycles consumed by those calls (overhead)
maxcycles = number of CPU cycles between those calls in FPU context
bpf = bytes_per_fpu loop size
KiB = bpf expressed in KiB
maxlen = maximum number of bytes per loop via update()
maxlen2 = maximum number of bytes per loop via finup()
This is on a 2.2 GHz Cascade Lake CPU, where each cycle is nominally
0.45 ns. The CPU does not support SHA-NI instructions, so those
results are missing.
Here are the results from a boot with the avx2 bytes_per_fpu values set
to 0 (unlimited - original behavior).
Booting includes:
- processing 2.3 GB of SHA-512 kernel module hashes
- crypto self-tests
- crypto extra self-tests (CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y)
calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
3641 177182 10230 0 0 4096 0 __ghash-pclmulqdqni ghash_clmulni_intel
2242 150516 1684 0 0 8112 0 crc32-pclmul crc32_pclmul
1008 43800 22404 0 0 8068 8105 crc32c-intel crc32c_intel
2565 179734 4286 0 0 7791 8027 crct10dif-pclmul crct10dif_pclmul
1603 77112 2414 0 0 8132 0 nhpoly1305-avx2 nhpoly1305_avx2
1671 81108 9390 203776 199 8109 0 nhpoly1305-sse2 nhpoly1305_sse2
1977 103598 5314 0 0 8112 0 poly1305-simd poly1305_x86_64
26744 1251756 2046 0 0 8096 0 polyval-clmulni polyval_clmulni
14669 682428 65462 30720 30 251 8096 sha1-avx sha1_ssse3
14669 682428 65462 0 0 7170 0 sha1-avx2 sha1_ssse3
14669 682428 65462 34816 34 0 0 sha1-shani sha1_ssse3
14669 682428 65462 26624 26 8089 8164 sha1-ssse3 sha1_ssse3
26768 1230100 144902 11264 11 8130 8159 sha224-avx sha256_ssse3
26768 1230100 144902 13312 13 8078 8146 sha224-avx2 sha256_ssse3
26768 1230100 144902 13312 13 0 0 sha224-shani sha256_ssse3
26768 1230100 144902 11264 11 8068 8168 sha224-ssse3 sha256_ssse3
26768 1230100 144902 11264 11 8130 8159 sha256-avx sha256_ssse3
26768 1230100 144902 13312 13 8078 8146 sha256-avx2 sha256_ssse3
26768 1230100 144902 13312 13 0 0 sha256-shani sha256_ssse3
26768 1230100 144902 11264 11 8068 8168 sha256-ssse3 sha256_ssse3
29157 2044882 164510724 17408 17 0 8127 sha384-avx sha512_ssse3
29157 2044882 164510724 0 0 0 48175432 sha384-avx2 sha512_ssse3
29157 2044882 164510724 17408 17 0 8055 sha384-ssse3 sha512_ssse3
29157 2044882 164510724 17408 17 0 8127 sha512-avx sha512_ssse3
29157 2044882 164510724 0 0 0 48175432 sha512-avx2 sha512_ssse3
29157 2044882 164510724 17408 17 0 8055 sha512-ssse3 sha512_ssse3
4314 193456 124918 0 0 7672 8101 sm3-avx sm3_avx_x86_64
The self-tests only test small data sets (even the extra tests
limit themselves to PAGE_SIZE * 2) so only the sha512_ssse3
module was stressed with large requests.
The cost of the kernel_fpu_begin()/end() calls (2044882 cycles) was
929 us, and the longest time in FPU context (164510724) was 75 ms.
I think the biggest file it encounters is:
-rw-r--r--. 1 root root 48186713 Nov 1 13:14 /lib/modules/6.0.0+/kernel/fs/xfs/xfs.ko
I added tcrypt tests to exercise each driver ten times with 1 MiB data,
and that exposes all the drivers to larger requests.
bigbuf tests with no limits:
calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
1000 156354 1484434 0 0 1048576 0 __ghash-pclmulqdqni ghash_clmulni_intel
1000 150386 221710 0 0 1048576 0 crc32-pclmul crc32_pclmul
1000 104890 114000 0 0 1048576 0 crc32c-intel crc32c_intel
1000 169596 182904 0 0 1048576 0 crct10dif-pclmul crct10dif_pclmul
1000 122842 267568 0 0 1048576 0 nhpoly1305-avx2 nhpoly1305_avx2
1000 190530 453118 0 0 1048576 0 nhpoly1305-sse2 nhpoly1305_sse2
1000 134682 431264 0 0 1048576 0 poly1305-simd poly1305_x86_64
8000 387206 215922 0 0 1048576 0 polyval-clmulni polyval_clmulni
6000 562932 2831190 0 0 1048576 0 sha1-avx sha1_ssse3
6000 562932 2831190 0 0 1048576 0 sha1-avx2 sha1_ssse3
6000 562932 2831190 34816 34 0 0 sha1-shani sha1_ssse3
6000 562932 2831190 0 0 1048576 0 sha1-ssse3 sha1_ssse3
12000 1212742 6558712 0 0 1048576 0 sha224-avx sha256_ssse3
12000 1212742 6558712 0 0 1048576 0 sha224-avx2 sha256_ssse3
12000 1212742 6558712 13312 13 0 0 sha224-shani sha256_ssse3
12000 1212742 6558712 0 0 1048576 0 sha224-ssse3 sha256_ssse3
12000 1212742 6558712 0 0 1048576 0 sha256-avx sha256_ssse3
12000 1212742 6558712 0 0 1048576 0 sha256-avx2 sha256_ssse3
12000 1212742 6558712 13312 13 0 0 sha256-shani sha256_ssse3
12000 1212742 6558712 0 0 1048576 0 sha256-ssse3 sha256_ssse3
12006 1250296 4621038 0 0 1048576 0 sha384-avx sha512_ssse3
12006 1250296 4621038 0 0 1048576 1037416 sha384-avx2 sha512_ssse3
12006 1250296 4621038 0 0 1048576 0 sha384-ssse3 sha512_ssse3
12006 1250296 4621038 0 0 1048576 0 sha512-avx sha512_ssse3
12006 1250296 4621038 0 0 1048576 1037416 sha512-avx2 sha512_ssse3
12006 1250296 4621038 0 0 1048576 0 sha512-ssse3 sha512_ssse3
2000 221468 6236756 0 0 1048576 0 sm3-avx sm3_avx_x86_64
Setting bpf limits based on those results narrows the maxcycles in
FPU context. I've seen results vary from 81912 (37 us) to
(102 us) - not real tight, but much better than ranging up
to 75 ms.
bigbuf tests with bytes_per_fpu limits as shown:
calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
21000 1002372 138558 51200 50 51200 0 __ghash-pclmulqdqni ghash_clmulni_intel
2000 220666 226806 646912 631 646912 0 crc32-pclmul crc32_pclmul
2000 255110 105968 895232 874 895232 0 crc32c-intel crc32c_intel
2000 218942 107930 626944 612 626944 0 crct10dif-pclmul crct10dif_pclmul
4000 208170 141356 345088 337 345088 0 nhpoly1305-avx2 nhpoly1305_avx2
6000 285286 105072 203520 198 203520 0 nhpoly1305-sse2 nhpoly1305_sse2
5000 368866 162262 222976 217 222976 0 poly1305-simd poly1305_x86_64
10000 457010 142362 402688 393 402688 0 polyval-clmulni polyval_clmulni
108000 6048076 160670 30720 30 30720 0 sha1-avx sha1_ssse3
108000 6048076 160670 34816 34 34816 0 sha1-avx2 sha1_ssse3
108000 6048076 160670 27392 26 27392 0 sha1-ssse3 sha1_ssse3
520000 23646576 196462 11520 11 11520 0 sha224-avx sha256_ssse3
520000 23646576 196462 14080 13 14080 0 sha224-avx2 sha256_ssse3
520000 23646576 196462 11776 11 11776 0 sha224-ssse3 sha256_ssse3
520000 23646576 196462 11520 11 11520 0 sha256-avx sha256_ssse3
520000 23646576 196462 14080 13 14080 0 sha256-avx2 sha256_ssse3
520000 23646576 196462 11776 11 11776 0 sha256-ssse3 sha256_ssse3
356156 18242860 226538 17152 16 17152 0 sha384-avx sha512_ssse3
356156 18242860 226538 20480 20 20480 20480 sha384-avx2 sha512_ssse3
356156 18242860 226538 17408 17 17408 0 sha384-ssse3 sha512_ssse3
356156 18242860 226538 17152 16 17152 0 sha512-avx sha512_ssse3
356156 18242860 226538 20480 20 20480 20480 sha512-avx2 sha512_ssse3
356156 18242860 226538 17408 17 17408 0 sha512-ssse3 sha512_ssse3
93000 4537164 138924 11520 11 11520 0 sm3-avx sm3_avx_x86_64
If I reboot with sha512-avx2 set to 20 KiB, the sha512-avx2
maxlength can still take a long time (e.g., 2 ms). That's much
better than the original 75 ms, but still not in the 50 us range.
I set /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to
"performance" in .bash_profile, but that's not effective during
boot, so maybe that is the source of variability.
Example boot with 20 KiB limit:
calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
161011 16232280 4049644 20480 20 0 20480 sha512-avx2 sha512_ssse3
Limiting it to 1 KiB does reduce maxcycles to the us range, but
the cost of all the extra calls soars.
So, for v3 of the series, I plan to propose values ranging from:
- 11 to 20 KiB for sha* amd sm3
- 200 to 400 Kib for *poly*
- 600 to 800 KiB for crc*
v3 will only cover the hash functions - skcipher and aead
have some unique challenges that we can tackle later.