[RFC v1 0/4] Allow dynamic allocation of software IO TLB bounce buffers

From: Petr Tesarik
Date: Mon Mar 20 2023 - 08:29:16 EST


From: Petr Tesarik <petr.tesarik.ext@xxxxxxxxxx>

The goal of my work is to provide more flexibility in the sizing of
SWIOTLB. This patch series is a request for comments from the wider
community. The code is more of a crude hack than final solution.

I would appreciate suggestions for measuring the performance impact
of changes in SWIOTLB. More info at the end of this cover letter.

The software IO TLB was designed with these assumptions:

1. It would not be used much, especially on 64-bit systems.
2. A small fixed memory area (64 MiB by default) is sufficient to
handle the few cases which require a bounce buffer.
3. 64 MiB is little enough that it has no impact on the rest of the
system.

First, if SEV is active, all DMA must be done through shared
unencrypted pages, and SWIOTLB is used to make this happen without
changing device drivers. The software IO TLB size is increased to
6% of total memory in sev_setup_arch(), but that is more of an
approximation. The actual requirements may vary depending on the
amount of I/O and which drivers are used. These factors may not be
know at boot time, i.e. when SWIOTLB is allocated.

Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
moved from the rendering GPU (v3d driver), which can access all
memory, to the display output (vc4 driver), which is connected to a
bus with an address limit of 1 GiB and no IOMMU. These buffers can
be large (8 MiB with a FullHD monitor, 34 MiB with a 4K monitor)
and cannot be even handled by current SWIOTLB, because they exceed
the maximum segment size of 256 KiB. Mapping failures can be
easily reproduced with GNOME remote desktop on a Raspberry Pi 4.

Third, other colleagues have noticed that they can reliably get rid
of occasional OOM kills on an Arm embedded device by reducing the
SWIOTLB size. This can be achieved with a kernel parameter, but
determining the right value puts additional burden on pre-release
testing, which could be avoided if SWIOTLB is allocated small and
grows only when necessary.

I have tried to measure the expected performance degradation so
that I could reduce it and/or compare it to alternative approaches.
I have performed all tests on an otherwise idle Raspberry Pi 4 with
swiotlb=force (which, addmittedly, is a bit artificial). I quickly
ran into trouble.

I ran fio against an ext3 filesystem mounted from a UAS drive. To
my surprise, forcing swiotlb (without my patches) *improved* IOPS
and bandwidth for 4K and 64K blocks by 3 to 7 percent, and made no
visible difference for 1M blocks. I also observed smaller minimum
and average completion latencies, and even smaller maximum
latencies for 4K blocks. However, when I ran the tests again later
to verify some oddities, there was a performance drop. It appears
that I/O, bandwidth and latencies reported by two consecutive fio
runs may differ by as much as 10%, so the results are invalid.

I tried to make a micro-benchmark on dma_map_page_attrs() using the
bcc tool funclatency, but just loading the eBPF program was enough
to change the behaviour of the system wildly.

I wonder if anyone can give me advice on measuring SWIOTLB
performance. I can see that AMD, IBM and Microsoft people have
mentioned performance in their patches, but AFAICS without
explaining how it was measured. Knowing a bit more would be much
appreciated.

Petr Tesarik (4):
dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute
swiotlb: Move code around in preparation for dynamic bounce buffers
swiotlb: Allow dynamic allocation of bounce buffers
swiotlb: Add an option to allow dynamic bounce buffers

.../admin-guide/kernel-parameters.txt | 6 +-
Documentation/core-api/dma-attributes.rst | 10 +
include/linux/dma-mapping.h | 6 +
include/linux/swiotlb.h | 17 +-
kernel/dma/swiotlb.c | 233 +++++++++++++++---
5 files changed, 241 insertions(+), 31 deletions(-)

--
2.25.1