Re: [PATCH net] xsk: remove cheap_dma optimization

From: BjÃrn TÃpel
Date: Sun Jun 28 2020 - 13:16:43 EST



On 2020-06-27 09:04, Christoph Hellwig wrote:
On Sat, Jun 27, 2020 at 01:00:19AM +0200, Daniel Borkmann wrote:
Given there is roughly a ~5 weeks window at max where this removal could
still be applied in the worst case, could we come up with a fix / proposal
first that moves this into the DMA mapping core? If there is something that
can be agreed upon by all parties, then we could avoid re-adding the 9%
slowdown. :/

I'd rather turn it upside down - this abuse of the internals blocks work
that has basically just missed the previous window and I'm not going
to wait weeks to sort out the API misuse. But we can add optimizations
back later if we find a sane way.


I'm not super excited about the performance loss, but I do get
Christoph's frustration about gutting the DMA API making it harder for
DMA people to get work done. Lets try to solve this properly using
proper DMA APIs.


That being said I really can't see how this would make so much of a
difference. What architecture and what dma_ops are you using for
those measurements? What is the workload?


The 9% is for an AF_XDP (Fast raw Ethernet socket. Think AF_PACKET, but faster.) benchmark: receive the packet from the NIC, and drop it. The DMA syncs stand out in the perf top:

28.63% [kernel] [k] i40e_clean_rx_irq_zc
17.12% [kernel] [k] xp_alloc
8.80% [kernel] [k] __xsk_rcv_zc
7.69% [kernel] [k] xdp_do_redirect
5.35% bpf_prog_992d9ddc835e5629 [k] bpf_prog_992d9ddc835e5629
4.77% [kernel] [k] xsk_rcv.part.0
4.07% [kernel] [k] __xsk_map_redirect
3.80% [kernel] [k] dma_direct_sync_single_for_cpu
3.03% [kernel] [k] dma_direct_sync_single_for_device
2.76% [kernel] [k] i40e_alloc_rx_buffers_zc
1.83% [kernel] [k] xsk_flush
...

For this benchmark the dma_ops are NULL (dma_is_direct() == true), and
the main issue is that SWIOTLB is now unconditionally enabled [1] for
x86, and for each sync we have to check that if is_swiotlb_buffer()
which involves a some costly indirection.

That was pretty much what my hack avoided. Instead we did all the checks
upfront, since AF_XDP has long-term DMA mappings, and just set a flag
for that.

Avoiding the whole "is this address swiotlb" in
dma_direct_sync_single_for_{cpu, device]() per-packet
would help a lot.

Somewhat related to the DMA API; It would have performance benefits for
AF_XDP if the DMA range of the mapped memory was linear, i.e. by IOMMU
utilization. I've started hacking a thing a little bit, but it would be
nice if such API was part of the mapping core.

Input: array of pages Output: array of dma addrs (and obviously dev,
flags and such)

For non-IOMMU len(array of pages) == len(array of dma addrs)
For best-case IOMMU len(array of dma addrs) == 1 (large linear space)

But that's for later. :-)


BjÃrn


[1] commit: 09230cbc1bab ("swiotlb: move the SWIOTLB config symbol to lib/Kconfig")