Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

From: Eric Biggers
Date: Wed Apr 22 2020 - 00:04:21 EST


On Mon, Apr 20, 2020 at 01:57:11AM -0600, Jason A. Donenfeld wrote:
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
> isn't totally scientific, but it's not a bad guess either, and it's
> what's used in both the x86 poly1305 and blake2s library code already.
> Unfortunately it appears to have been left out of the final patchset
> that actually added the glue code. So, this commit adds back the
> PAGE_SIZE chunking.
>
> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Cc: Eric Biggers <ebiggers@xxxxxxxxxx>
> Cc: Ard Biesheuvel <ardb@xxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>
> ---
> Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
> patches, or if there was actually some later discussion in which we
> concluded that the PAGE_SIZE chunking wasn't required, perhaps because
> of FPU changes. If that's the case, please do let me know, in which case
> I'll submit a _different_ patch that removes the chunking from x86 poly
> and blake. I can't find any emails that would indicate that, but I might
> be mistaken.
>
> arch/arm/crypto/chacha-glue.c | 16 +++++++++++++---
> arch/arm/crypto/poly1305-glue.c | 17 +++++++++++++----
> arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++---
> arch/arm64/crypto/poly1305-glue.c | 17 +++++++++++++----
> arch/x86/crypto/chacha_glue.c | 16 +++++++++++++---
> 5 files changed, 65 insertions(+), 17 deletions(-)

I don't think you're missing anything. On x86, kernel_fpu_begin() and
kernel_fpu_end() did get optimized in v5.2. But they still disable preemption,
which is the concern here.

>
> diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
> index 6fdb0ac62b3d..0e29ebac95fd 100644
> --- a/arch/arm/crypto/chacha-glue.c
> +++ b/arch/arm/crypto/chacha-glue.c
> @@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
> return;
> }
>
> - kernel_neon_begin();
> - chacha_doneon(state, dst, src, bytes, nrounds);
> - kernel_neon_end();
> + for (;;) {
> + unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> +
> + kernel_neon_begin();
> + chacha_doneon(state, dst, src, todo, nrounds);
> + kernel_neon_end();
> +
> + bytes -= todo;
> + if (!bytes)
> + break;
> + src += todo;
> + dst += todo;
> + }
> }
> EXPORT_SYMBOL(chacha_crypt_arch);

Seems this should just be a 'while' loop?

while (bytes) {
unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);

kernel_neon_begin();
chacha_doneon(state, dst, src, todo, nrounds);
kernel_neon_end();

bytes -= todo;
src += todo;
dst += todo;
}

Likewise elsewhere in this patch. (For Poly1305, len >= POLY1305_BLOCK_SIZE at
the beginning, so that could use a 'do' loop.)

- Eric