Re: BUG: IPv6 stops working after a while, needs ip ne del commandto reset

From: Eric Dumazet
Date: Wed Sep 01 2010 - 09:19:43 EST


Le mercredi 01 septembre 2010 Ã 11:21 +0200, Thomas Habets a Ãcrit :
> I've continued this a bit off-list but thought I would summarize for the
> archives.
>
>
> Summary
> -------
> It looks like a firmware issue on the network card. When ILO is enabled it
> shares the first network card with the OS. When it does this multicast
> is broken. When multicast (on a L2 level) is broken IPv6 neighbor
> discovery breaks. Only eth0 breaks, eth1 is unaffected.
>
>
> System
> ------
> HP Proliant DL320 G5p
> Xeon 3GHz
> 1GB RAM
> Arch: amd64
> NIC: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
> Debian Lenny (5.0.5)
> Kernels: 2.6.35 mainline, 2.6.33.6
> Config: http://pastebin.com/raw.php?i=Y6S8iKW7
>
>
> Problem
> -------
> Buggy box will not answer IPv6 ND or ping to ff02::1. May work at some
> point in the boot process, but once box is fully booted it does not.
>
> If I on the neighboring Cisco router run "clear ipv6 neighbors" (or it
> times out) that router cannot re-acquire the neigborship with the buggy
> box. Instant IPv6 breakage until I do one of:
> * Turn on promisc mode long enough for IPv6 ND to do its thing
> * ip ne del <address of neighbor> on the buggy host.
>
>
> Workarounds
> -----------
> Either one of these will hide the problem:
> * Set promisc mode on interface (ip link set promisc on eth0) forever
> * Disable ILO
> * Use eth1 instead of eth0.
>
>
> Troubleshooting
> ---------------
> Got patch for kernel from Eric Dumazet (eric.dumazet@xxxxxxxxx) to output
> what MAC addresses are being subscribed to, and some registers from the
> card. Output is earlier in this thread, along with "ethtool -i eth0" and
> some other data.
>
> Managed to get diagnostic tool[1] booting from stick (no CD drive in
> server), but did not set up memory (himem.sys etc..). Running b57udiag
> it therefore failed due to insufficient memory at test "Group D. Driver
> Associated tests". Card is assumed to be OK anyway.
>
> Matt Carlson (mcarlson@xxxxxxxxxxxx) suspected firmware bug and asked me
> to try disabling ASF and/or IPMI using the diagnostic tool, but running
> "setasf -d" and "setipmi -d" inside "b57udiag -cmd" did not seem to stick
> across reboot. It stuck properly before reboot (confirmed with setasf -q).
> Also tried "b57udiag -u 0". Tried both C-A-D reboot and powercycling (by
> power cord).
>
> At boot Linux still said ASF[1] for eth0 and ASF[0] for eth1:
> tg3 0000:03:04.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
> tg3 0000:03:04.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> (this output never changed throughout the process)
> ethtool -d eth1 | grep 0x047 did not change either.
>
> Then I disabled ILO and PXE in ILO bios and BIOS respectively. That fixed
> it. eth0 now works with multicast.
>
> I don't use ILO on this server so in this case that fixes it for me, but
> the bug is still there.
>
> At this point Matt thinks I should file a bug report with HP. I will
> attempt to do that.
>
> I have more detailed logs of what I did and when, and what the effect was.
>
>
> Related
> -------
> May be the same issue as this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263260
> Which means it's the same with Ubuntu kernels 2.6.26.3, 2.6.26-5-generic
> and 2.6.27-2-generic, and mainline kernels 2.6.25, 2.6.26 and 2.6.27.
>
>
> [1] http://www.broadcom.com/support/ethernet_nic/netxtreme_server.php
>


Thanks a lot Thomas for this very detailed report !


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/