Re: [PATCH v7 00/15] Optimizing iommu_[map/unmap] performance

From: chenxiang (M)
Date: Wed Jul 14 2021 - 21:51:19 EST




在 2021/7/15 9:23, Lu Baolu 写道:
On 7/14/21 10:24 PM, Georgi Djakov wrote:
On 16.06.21 16:38, Georgi Djakov wrote:
When unmapping a buffer from an IOMMU domain, the IOMMU framework unmaps
the buffer at a granule of the largest page size that is supported by
the IOMMU hardware and fits within the buffer. For every block that
is unmapped, the IOMMU framework will call into the IOMMU driver, and
then the io-pgtable framework to walk the page tables to find the entry
that corresponds to the IOVA, and then unmaps the entry.

This can be suboptimal in scenarios where a buffer or a piece of a
buffer can be split into several contiguous page blocks of the same size.
For example, consider an IOMMU that supports 4 KB page blocks, 2 MB page
blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is being
unmapped at IOVA 0. The current call-flow will result in 4 indirect calls,
and 2 page table walks, to unmap 2 entries that are next to each other in
the page-tables, when both entries could have been unmapped in one shot
by clearing both page table entries in the same call.

The same optimization is applicable to mapping buffers as well, so
these patches implement a set of callbacks called unmap_pages and
map_pages to the io-pgtable code and IOMMU drivers which unmaps or maps
an IOVA range that consists of a number of pages of the same
page size that is supported by the IOMMU hardware, and allows for
manipulating multiple page table entries in the same set of indirect
calls. The reason for introducing these callbacks is to give other IOMMU
drivers/io-pgtable formats time to change to using the new callbacks, so
that the transition to using this approach can be done piecemeal.

Hi Will,

Did you get a chance to look at this patchset? Most patches are already
acked/reviewed and all still applies clean on rc1.

I also have the ops->[un]map_pages implementation for the Intel IOMMU
driver. I will post them once the iommu/core part get applied.

I also implement those callbacks on ARM SMMUV3 based on this series, and use dma_map_benchmark to have a test on
the latency of map/unmap as follows, and i think it promotes much on the latency of map/unmap. I will also plan to post
the implementations for ARM SMMUV3 after this series are applied.

t = 1(thread = 1):
before opt(us) after opt(us)
g=1(4K size) 0.1/1.3 0.1/0.8
g=2(8K size) 0.2/1.5 0.2/0.9
g=4(16K size) 0.3/1.9 0.1/1.1
g=8(32K size) 0.5/2.7 0.2/1.4
g=16(64K size) 1.0/4.5 0.2/2.0
g=32(128K size) 1.8/7.9 0.2/3.3
g=64(256K size) 3.7/14.8 0.4/6.1
g=128(512K size) 7.1/14.7 0.5/10.4
g=256(1M size) 14.0/53.9 0.8/19.3
g=512(2M size) 0.2/0.9 0.2/0.9
g=1024(4M size) 0.5/1.5 0.4/1.0

t = 10(thread = 10):
before opt(us) after opt(us)
g=1(4K size) 0.3/7.0 0.1/5.8
g=2(8K size) 0.4/6.7 0.3/6.0
g=4(16K size) 0.5/6.3 0.3/5.6
g=8(32K size) 0.5/8.3 0.2/6.3
g=16(64K size) 1.0/17.3 0.3/12.4
g=32(128K size) 1.8/36.0 0.2/24.2
g=64(256K size) 4.3/67.2 1.2/46.4
g=128(512K size) 7.8/93.7 1.3/94.2
g=256(1M size) 14.7/280.8 1.8/191.5
g=512(2M size) 3.6/3.2 1.5/2.5
g=1024(4M size) 2.0/3.1 1.8/2.6



Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/iommu

.