[PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

From: Sven Van Asbroeck
Date: Fri Jan 29 2021 - 15:11:24 EST


From: Sven Van Asbroeck <thesven73@xxxxxxxxx>

The buffers in the lan743x driver's receive ring are always 9K,
even when the largest packet that can be received (the mtu) is
much smaller. This performs particularly badly on cpu archs
without dma cache snooping (such as ARM): each received packet
results in a 9K dma_{map|unmap} operation, which is very expensive
because cpu caches need to be invalidated.

Careful measurement of the driver rx path on armv7 reveals that
the cpu spends the majority of its time waiting for cache
invalidation.

Optimize as follows:

1. set rx ring buffer size equal to the mtu. this limits the
amount of cache that needs to be invalidated per dma_map().

2. when dma_unmap()ping, skip cpu sync. Sync only the packet data
actually received, the size of which the chip will indicate in
its rx ring descriptors. this limits the amount of cache that
needs to be invalidated per dma_unmap().

These optimizations double the rx performance on armv7.
Third parties report 3x rx speedup on armv8.

Performance on dma cache snooping architectures (such as x86)
is expected to stay the same.

Tested with iperf3 on a freescale imx6qp + lan7430, both sides
set to mtu 1500 bytes, measure rx performance:

Before:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-20.00 sec 550 MBytes 231 Mbits/sec 0
After:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-20.00 sec 1.33 GBytes 570 Mbits/sec 0

Test by Anders Roenningen (anders@xxxxxxxxxxxxxxxxx) on armv8,
rx iperf3:
Before 102 Mbits/sec
After 279 Mbits/sec

Signed-off-by: Sven Van Asbroeck <thesven73@xxxxxxxxx>
---

Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git # 46eb3c108fe1

To: Bryan Whitehead <bryan.whitehead@xxxxxxxxxxxxx>
To: UNGLinuxDriver@xxxxxxxxxxxxx
To: "David S. Miller" <davem@xxxxxxxxxxxxx>
To: Jakub Kicinski <kuba@xxxxxxxxxx>
Cc: Andrew Lunn <andrew@xxxxxxx>
Cc: Alexey Denisov <rtgbnm@xxxxxxxxx>
Cc: Sergej Bauer <sbauer@xxxxxxxxxxx>
Cc: Tim Harvey <tharvey@xxxxxxxxxxxxx>
Cc: Anders Rønningen <anders@xxxxxxxxxxxxxxxxx>
Cc: netdev@xxxxxxxxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx (open list)

drivers/net/ethernet/microchip/lan743x_main.c | 35 ++++++++++++-------
1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index f1f6eba4ace4..f485320e5784 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -1957,11 +1957,11 @@ static int lan743x_rx_next_index(struct lan743x_rx *rx, int index)

static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx)
{
- int length = 0;
+ struct net_device *netdev = rx->adapter->netdev;

- length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
- return __netdev_alloc_skb(rx->adapter->netdev,
- length, GFP_ATOMIC | GFP_DMA);
+ return __netdev_alloc_skb(netdev,
+ netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING,
+ GFP_ATOMIC | GFP_DMA);
}

static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index)
@@ -1977,9 +1977,10 @@ static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index,
{
struct lan743x_rx_buffer_info *buffer_info;
struct lan743x_rx_descriptor *descriptor;
- int length = 0;
+ struct net_device *netdev = rx->adapter->netdev;
+ int length;

- length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING);
+ length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
descriptor = &rx->ring_cpu_ptr[index];
buffer_info = &rx->buffer_info[index];
buffer_info->skb = skb;
@@ -2148,11 +2149,18 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx)
descriptor = &rx->ring_cpu_ptr[first_index];

/* unmap from dma */
+ packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
+ (descriptor->data0);
if (buffer_info->dma_ptr) {
- dma_unmap_single(&rx->adapter->pdev->dev,
- buffer_info->dma_ptr,
- buffer_info->buffer_length,
- DMA_FROM_DEVICE);
+ dma_sync_single_for_cpu(&rx->adapter->pdev->dev,
+ buffer_info->dma_ptr,
+ packet_length,
+ DMA_FROM_DEVICE);
+ dma_unmap_single_attrs(&rx->adapter->pdev->dev,
+ buffer_info->dma_ptr,
+ buffer_info->buffer_length,
+ DMA_FROM_DEVICE,
+ DMA_ATTR_SKIP_CPU_SYNC);
buffer_info->dma_ptr = 0;
buffer_info->buffer_length = 0;
}
@@ -2167,8 +2175,8 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx)
int index = first_index;

/* multi buffer packet not supported */
- /* this should not happen since
- * buffers are allocated to be at least jumbo size
+ /* this should not happen since buffers are allocated
+ * to be at least the mtu size configured in the mac.
*/

/* clean up buffers */
@@ -2628,6 +2636,9 @@ static int lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
struct lan743x_adapter *adapter = netdev_priv(netdev);
int ret = 0;

+ if (netif_running(netdev))
+ return -EBUSY;
+
ret = lan743x_mac_set_mtu(adapter, new_mtu);
if (!ret)
netdev->mtu = new_mtu;
--
2.17.1