Re: [PATCH 0/4] Add dynamic iommu backed bounce buffers

From: Robin Murphy
Date: Thu Jul 08 2021 - 13:14:34 EST


On 2021-07-08 10:29, Joerg Roedel wrote:
Adding Robin too.

On Wed, Jul 07, 2021 at 04:55:01PM +0900, David Stevens wrote:
Add support for per-domain dynamic pools of iommu bounce buffers to the
dma-iommu API. This allows iommu mappings to be reused while still
maintaining strict iommu protection. Allocating buffers dynamically
instead of using swiotlb carveouts makes per-domain pools more amenable
on systems with large numbers of devices or where devices are unknown.

But isn't that just as true for the currently-supported case? All you need is a large enough Thunderbolt enclosure and you could suddenly plug in a dozen untrusted GPUs all wanting to map hundreds of megabytes of memory. If there's a real concern worth addressing, surely it's worth addressing properly for everyone.

When enabled, all non-direct streaming mappings below a configurable
size will go through bounce buffers. Note that this means drivers which
don't properly use the DMA API (e.g. i915) cannot use an iommu when this
feature is enabled. However, all drivers which work with swiotlb=force
should work.

Bounce buffers serve as an optimization in situations where interactions
with the iommu are very costly. For example, virtio-iommu operations in
a guest on a linux host require a vmexit, involvement the VMM, and a
VFIO syscall. For relatively small DMA operations, memcpy can be
significantly faster.

Yup, back when the bounce-buffering stuff first came up I know networking folks were interested in terms of latency for small packets - virtualised IOMMUs are indeed another interesting case I hadn't thought of. It's definitely been on the radar as another use-case we'd like to accommodate with the bounce-buffering scheme. However, that's the thing: bouncing is bouncing and however you look at it it still overlaps so much with the untrusted case - there's no reason that couldn't use pre-mapped bounce buffers too, for instance - that the only necessary difference is really the policy decision of when to bounce. iommu-dma has already grown complicated enough, and having *three* different ways of doing things internally just seems bonkers and untenable. Pre-map the bounce buffers? Absolutely. Dynamically grow them on demand? Yes please! Do it all as a special thing in its own NIH module and leave the existing mess to rot? Sorry, but no.

Thanks,
Robin.

As a performance comparison, on a device with an i5-10210U, I ran fio
with a VFIO passthrough NVMe drive with '--direct=1 --rw=read
--ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k, and
128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
by >99%, as bounce buffers don't require syncing here in the read case.
Running with multiple jobs doesn't serve as a useful performance
comparison because virtio-iommu and vfio_iommu_type1 both have big
locks that significantly limit mulithreaded DMA performance.

This patch set is based on v5.13-rc7 plus the patches at [1].

David Stevens (4):
dma-iommu: add kalloc gfp flag to alloc helper
dma-iommu: replace device arguments
dma-iommu: expose a few helper functions to module
dma-iommu: Add iommu bounce buffers to dma-iommu api

drivers/iommu/Kconfig | 10 +
drivers/iommu/Makefile | 1 +
drivers/iommu/dma-iommu.c | 119 ++++--
drivers/iommu/io-buffer-pool.c | 656 +++++++++++++++++++++++++++++++++
drivers/iommu/io-buffer-pool.h | 91 +++++
include/linux/dma-iommu.h | 12 +
6 files changed, 861 insertions(+), 28 deletions(-)
create mode 100644 drivers/iommu/io-buffer-pool.c
create mode 100644 drivers/iommu/io-buffer-pool.h

--
2.32.0.93.g670b81a890-goog