Re: [PATCH 07/10] crypto: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN

From: Catalin Marinas
Date: Thu Apr 14 2022 - 15:49:52 EST


On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote:
> On Tue, Apr 12, 2022 at 10:47 PM Catalin Marinas
> <catalin.marinas@xxxxxxx> wrote:
> > I agree. There is also an implicit expectation that the DMA API works on
> > kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the
> > dynamic arch_kmalloc_minalign() in this series). But the key point is
> > that the driver doesn't need to know the CPU cache topology, coherency,
> > the DMA API and kmalloc() take care of these.
>
> Honestly, I think it would probably be worth discussing the "kmalloc
> DMA alignment" issues.
>
> 99.9% of kmalloc users don't want to do DMA.
>
> And there's actually a fair amount of small kmalloc for random stuff.
> Right now on my laptop, I have
>
> kmalloc-8 16907 18432 8 512 1 : ...
>
> according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.
>
> It's all kinds of sad if those allocations need to be 64 bytes in size
> just because of some silly DMA alignment issue, when none of them want
> it.

It's a lot worse, ARCH_KMALLOC_MINALIGN is currently 128 bytes on arm64.
I want to at least get it down to 64 with this series while preserving
the current kmalloc() semantics.

If we know the SoC is fully coherent (a bit tricky with late probed
devices), we could get the alignment down to 8. In the mobile space,
unfortunately, most DMA is non-coherent.

I think it's worth investigating the __dma annotations that Greg
suggested, though I have a suspicion it either is too difficult to track
or we just end up with this annotation everywhere. There are cases where
the memory is allocated outside the driver that knows the DMA needs,
though I guess these are either full page allocations or
kmem_cache_alloc() (e.g. page cache pages, skb).

There's also Ard's suggestion to bounce the (inbound DMA) buffer if not
aligned. That's doable but dma_map_single(), for example, only gets the
size of some random structure/buffer. If the size is below
ARCH_DMA_MINALIGN (or cache_line_size()), the DMA API implementation
would have to retrieve the slab cache, check the real allocation size
and then bounce if necessary.

Irrespective of which option we go for, I think at least part of this
series decoupling ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN is still
needed since currently the minalign is used in some compile time
attributes. Even getting the kmalloc() size down to 64 is a significant
improvement over 128.

Subsequently I'd attempt Ard's bouncing idea as a quick workaround and
assess the bouncing overhead on some real platforms. This may be needed
before we track down all places to use dma_kmalloc().

I need to think some more on Greg's __dma annotation, as I said the
allocation may be decoupled from the driver in some cases.

--
Catalin