Re: Kernel Panic in skb_release_data using genet

From: Florian Fainelli
Date: Fri Jul 02 2021 - 12:49:37 EST


Hey Maxime,

On 6/25/2021 5:59 AM, Maxime Ripard wrote:
Hi Florian,

Sorry for the late reply

On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
On 6/2/2021 6:28 AM, Maxime Ripard wrote:
On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
That is also how I boot my Pi4 at home, and I suspect you are right, if
the VPU does not shut down GENET's DMA, and leaves buffer addresses in
the on-chip descriptors that point to an address space that is managed
totally differently by Linux, then we can have a serious problem and
create some memory corruption when the ring is being reclaimed. I will
run a few experiments to test that theory and there may be a solution
using the SW_INIT reset controller to have a big reset of the controller
before handing it over to the Linux driver.

Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
that the TX or RX DMA have been left running during the hand over from
the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
as much as possible the differences between your set-up and my set-up
but so far have not been able to reproduce the crash in booting from NFS
repeatedly, I will try again.

FWIW I can reproduce the error too. That said it's rather hard to reproduce,
something in the order of 1 failure every 20 tries.

Yeah, it looks like it's only from a cold boot and comes in "bursts",
where you would get like 5 in a row and be done with it for a while.

Here are two patches that you could try exclusive from one another

1) Limit GENET to a single queue

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index fcca023f22e5..e400c12e6868 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
net_device *dev, bool new_carrier)
return 0;
}

+static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
*skb,
+ struct net_device *sb_dev)
+{
+ return 0;
+}
+
static const struct net_device_ops bcmgenet_netdev_ops = {
.ndo_open = bcmgenet_open,
.ndo_stop = bcmgenet_close,
@@ -3666,6 +3672,7 @@ static const struct net_device_ops
bcmgenet_netdev_ops = {
#endif
.ndo_get_stats = bcmgenet_get_stats,
.ndo_change_carrier = bcmgenet_change_carrier,
+ .ndo_select_queue = bcmgenet_select_queue,
};

/* Array of GENET hardware parameters/characteristics */

2) Ensure that all TX/RX queues are disabled upon DMA initialization

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index fcca023f22e5..7f8a5996fbbb 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
bcmgenet_priv *priv,
/* Returns a reusable dma control register value */
static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
{
+ unsigned int i;
u32 reg;
u32 dma_ctrl;

/* disable DMA */
dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
+ for (i = 0; i < priv->hw_params->tx_queues; i++)
+ dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
reg &= ~dma_ctrl;
bcmgenet_tdma_writel(priv, reg, DMA_CTRL);

+ dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
+ for (i = 0; i < priv->hw_params->rx_queues; i++)
+ dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
reg &= ~dma_ctrl;
bcmgenet_rdma_writel(priv, reg, DMA_CTRL);

I had a bunch of issues popping up today so I took the occasion to test
those patches. The first one doesn't change anything, I still had the
crash occurring with it. With the second applied (in addition), it seems
like it's fixed. I'll keep testing and will let you know.

Did this patch survive more days of testing? I am tempted to send it regardless of your testing because it is a correctness issue that is being fixed. There is a global DMA enable bit which should "cut" any TX/RX queues, but still, for symmetry with other code paths all queues should be disabled.

Thanks!
--
Florian