Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

From: Jakub Kicinski
Date: Fri Jan 29 2021 - 17:02:41 EST


On Fri, 29 Jan 2021 14:52:35 -0500 Sven Van Asbroeck wrote:
> From: Sven Van Asbroeck <thesven73@xxxxxxxxx>
>
> The buffers in the lan743x driver's receive ring are always 9K,
> even when the largest packet that can be received (the mtu) is
> much smaller. This performs particularly badly on cpu archs
> without dma cache snooping (such as ARM): each received packet
> results in a 9K dma_{map|unmap} operation, which is very expensive
> because cpu caches need to be invalidated.
>
> Careful measurement of the driver rx path on armv7 reveals that
> the cpu spends the majority of its time waiting for cache
> invalidation.
>
> Optimize as follows:
>
> 1. set rx ring buffer size equal to the mtu. this limits the
> amount of cache that needs to be invalidated per dma_map().
>
> 2. when dma_unmap()ping, skip cpu sync. Sync only the packet data
> actually received, the size of which the chip will indicate in
> its rx ring descriptors. this limits the amount of cache that
> needs to be invalidated per dma_unmap().
>
> These optimizations double the rx performance on armv7.
> Third parties report 3x rx speedup on armv8.
>
> Performance on dma cache snooping architectures (such as x86)
> is expected to stay the same.
>
> Tested with iperf3 on a freescale imx6qp + lan7430, both sides
> set to mtu 1500 bytes, measure rx performance:
>
> Before:
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-20.00 sec 550 MBytes 231 Mbits/sec 0
> After:
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-20.00 sec 1.33 GBytes 570 Mbits/sec 0
>
> Test by Anders Roenningen (anders@xxxxxxxxxxxxxxxxx) on armv8,
> rx iperf3:
> Before 102 Mbits/sec
> After 279 Mbits/sec
>
> Signed-off-by: Sven Van Asbroeck <thesven73@xxxxxxxxx>

You may need to rebase to see this:

drivers/net/ethernet/microchip/lan743x_main.c:2123:41: warning: restricted __le32 degrades to integer