Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path

From: Eric Biggers
Date: Sun Jun 15 2025 - 14:47:33 EST


On Sun, Jun 15, 2025 at 09:22:51AM +0200, Ard Biesheuvel wrote:
> On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> >
> ...
> > After disabling the crypto self-tests, I was then able to run a benchmark of
> > SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> > recursion bug. I got the following results:
> >
> > ARMv8 crypto extensions: 1864 MB/s
> > Generic C code: 358 MB/s
> > Qualcomm Crypto Engine: 55 MB/s
> >
> > So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> > Crypto Engine one are useful, and the changes that you're requiring to the
> > CPU-based code are to support these drivers?
> >
>
> And this offload engine only has one internal queue, right? Whereas
> the CPU results may be multiplied by the number of cores on the soc.
> It would still be interesting how much of this is due to latency
> rather than limited throughput but it seems highly unlikely that there
> are any message sizes large enough where QCE would catch up with the
> CPUs. (AIUI, the only use case we have in the kernel today for message
> sizes that are substantially larger than this is kTLS, but I'm not
> sure how well it works with crypto_aead compared to offload at a more
> suitable level in the networking stack, and this driver does not
> implement GCM in the first place)
>
> On ARM socs, these offload engines usually exist primarily for the
> benefit of the verified boot implementation in mask ROM, which
> obviously needs to be minimal but doesn't have to be very fast in
> order to get past the first boot stages and hand over to software.
> Then, since the IP block is there, it's listed as a feature in the
> data sheet, even though it is not very useful when running under the
> OS.

With 1 MiB messages, I get 1913 MB/s with ARMv8 CE and 142 MB/s with QCE.

(BTW, that's single-buffer ARMv8 CE. My two-buffer code is over 3000 MB/s.)

I then changed my benchmark code to take full advantage of the async API and
submit as many requests as the hardware can handle. (This would be a best-case
scenario for QCE; in many real use cases this is not possible.) Result with QCE
was 58 MB/s with 4 KiB messages or 155 MB/s for 1 MiB messages.

So yes, QCE seems to have only one queue, and even that one queue is *much*
slower than just using the CPU. It's even slower than the generic C code.

And until I fixed it recently, the Crypto API defaulted to using QCE instead of
ARMv8 CE.

But this seems to be a common pattern among the offload engines.
I noticed a similar issue with Intel QAT, which I elaborate on in this patch:
https://lore.kernel.org/r/20250615045145.224567-1-ebiggers@xxxxxxxxxx

- Eric