Re: exit_mmap BUG_ON in 2.6.23 (and Add qdisc __NET_XMIT_STOLEN)

From: Sam Portolla
Date: Sat May 26 2012 - 01:28:09 EST






----- Original Message -----
From: Eric Dumazet <eric.dumazet@xxxxxxxxx>
To: Sam Portolla <samportolla@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>; "kaber@xxxxxxxxx" <kaber@xxxxxxxxx>; "jarkao2@xxxxxxxxx" <jarkao2@xxxxxxxxx>; "davem@xxxxxxxxxxxxx" <davem@xxxxxxxxxxxxx>; "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>
Sent: Friday, May 25, 2012 9:25 PM
Subject: Re: exit_mmap BUG_ON in 2.6.23 (and Add qdisc __NET_XMIT_STOLEN)

On Fri, 2012-05-25 at 17:28 -0700, Sam Portolla wrote:

Please don't top post on this list

>
> [pease cc samPortolla@xxxxxxxxx on the replies; not a member of this
> mailer]
>
> Hi Hugh,
>
> Thank you!  It turns out our 2.6.23 kernel does not have this old
> patch, I am also adding Jarek, David and Patrick who were involved in
> the below fix for their insights:
>
>
> commit 378a2f090f7a478704a372a4869b8a9ac206234e
> Date:  Mon Aug 4 22:31:03 2008 -0700
> net_sched: Add qdisc __NET_XMIT_STOLEN flag
> In this failure case below, as well as some others, the ethernet
> driver printed a transmit timeout just before the crash.
>
> It seems since we don't have the above patch, the kernel qdisc Tx
> packet path for fragmented packets can be messed up and corrupt the
> skb  it passes to drivers, which in the historic case that led to
> above fix, caused an skb NULL ptr de-ref in the driver itself (which
> we also saw once).
>
> Jarek, David or Patrick,
>
> Could the lack of above patch cause the kernel to also falsely detect
> transmit timeouts on various drivers as it can not properly keep track
> of packets transmitted? Can you please elaborate so  a newbie like me
> can understand?
>
> Is the above commit the sole one required for the kernel panic/skb
> NULL de-ref driver issue or is there more needed fixes later on that
> can be backported to an older kernel (2.6.23 GNU/Linux x86_64)?
>

Transmit timeouts are because of races in some network drivers.

The device stay in XOFF state for too long time (forever as a matter of
fact once the race triggered)

Since 2.6.23 we fixed a lot of them, but still races still exist.

Yes, thanks I had looked at the kernel  code and know how transmit timeouts come to be in normal cases. The driver specifies a timeout period to the network layer, along with a callback function to call in case of Tx timeout so the driver can do error handling which is typically to reset the driver (and this happened in the case of the BNX2 linux driver our system uses as well). Above I had asked some specific questions with regards to whether a known bug w/ qdisc could stop the Tx Q's to the device and thereby cause traffic timeouts. Also it seems from the email thread on the patch I had mentioned above that the qdisc issue can cause memory corruption, which could then tie it in with the BUG_ON in exit_mmap() which Hugh had previously commented on. I am hoping the engineers who fixed the QDISC issue can comment on the former and Hugh can comment on the BUG_ON again. Regards.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/