Re: via-rhine: NETDEV WATCHDOG: eth0: transmit timed out

From: Marco Colombo (marco@esi.it)
Date: Thu Jun 08 2000 - 06:06:23 EST


On Wed, 7 Jun 2000, Urban Widmark wrote:

> (patch for testing at the end. it shouldn't make anything worse so please
> test if you are getting these errors.)
>
> On Wed, 7 Jun 2000, Marco Colombo wrote:
>
> > On Tue, 6 Jun 2000, Urban Widmark wrote:
> [snip]
> > > and try and decode it (the first 2 lines are the current descriptors 0x00
> > > - 0x1f is rx, 0x20 - 0x3f is tx, same format as the rx_desc/tx_desc
>
> (This is of course wrong, it should be:
> 0x20 current rx
> 0x30 next rx
> 0x40 current tx
> 0x50 next tx)
>
>
> > VIA VT86C100A Rhine 10/100 chip registers at 0xa400
> > 0x000: c1ba5000 806c93e8 0000085a 4eff0000 00000000 00000000 075a7000 075a7120
> > 0x020: 80000400 00000600 0759d810 075a7010 80000000 00000600 015ff010 075a7020
> > 0x040: 80000000 00e085ea 01758c00 075a7130 80000000 00e082c6 01759200 075a7140
> ^^^^^^^^
> > 0x060: 063e0878 017591c4 00000000 00061008 782d0100 00000080 00070000 00000000
>
> 80000000 means that the "owner" bit is set and means that this descriptor
> is owned by the card (ie that a transmit has been started). The next
> descriptor also has this set, so unless you were sending a lot this is
> probably the one with problem.
>
> If all of them were "unsent" because of collisions there should be a
> interrupt status bit set, but it isn't. Hmm, in your report from using 2.2
> you wrote that you got:
> eth0: Something Wicked happened! 001a.
> last message repeated 2 times
> is that just before it stops working?

Yes. There are some of them. These messages are harmless, usually.
Sometime it just happens, and there a single (isolated) message.
But, when you see a burst of them (usually 4-8, sometimes just 2,
sometimes more), then it's going to stop soon (with the transmit timeout
message repeated until you down the interface).

> 001a is: transmit buffer underflow, packet transmission aborted because of
> excessive collision, packet transmitted with no errors.
> or IntrTxDone | IntrTxAbort | IntrTxUnderrun.
>
> With debug > 1 you should get "Transmitter underrun" messages too. Do you?

Yes, but very few of them. They seem to be unrelated.

>
> > Output "B" is the same, but for registers:
> >
> > 0x000: c1ba5000 206c93e8 0000081a 4eff0000 00000000 00000000 075a7000 075a7100
> > 0x020: 80000400 00000600 0641f810 075a7010 80000000 00000600 0641f010 075a7020
> > 0x040: 00000000 00e08000 badf00d0 075a7140 00000000 00e08000 badf00d0 075a7140
> ^^^^^^^^
> > 0x060: 0778b16c 01758e4e 00000000 00061008 782d0100 00000080 00070000 00000000
>
> I don't know who has written badf00d0 here as buffer pointer ... (the
> driver writes to the rx ring on netdev_close/via_rhine_close) but they
> shouldn't matter since it's not being used. I believe the 'next' looks
> like the 'current' when it is idle, so that's ok too.
>
>
> > I believe "A" is from a stopped state, "B" a working one, but I can't
> > tell for sure. More to come when the K7V is up again.
>
> You could also try this, or some variant of this. The "tx_timeout" have a
> few "to do's".
>
> --- linux-2.4.0-test1/drivers/net/via-rhine.c Sat May 27 12:20:05 2000
> +++ linux/drivers/net/via-rhine.c Wed Jun 7 21:01:26 2000
> @@ -816,12 +816,20 @@
>
> /* XXX Perhaps we should reinitialize the hardware here. */
> dev->if_port = 0;
> + writew(CmdReset, ioaddr + ChipCmd);
> +
> + np->chip_cmd = CmdStart|CmdTxOn|CmdRxOn|CmdNoTxPoll;
> + if (np->duplex_lock)
> + np->chip_cmd |= CmdFDuplex;
> + writew(np->chip_cmd, ioaddr + ChipCmd);
> +
>
> /* Stop and restart the chip's Tx processes . */
> /* XXX to do */
>
> /* Trigger an immediate transmit demand. */
> - /* XXX to do */
> + writew(CmdTxDemand | np->chip_cmd, dev->base_addr + ChipCmd);
> +
>
> dev->trans_start = jiffies;
> np->stats.tx_errors++;
>
> If this fails you could try adding other bits of code that is being done
> at startup ... the tx_timout is almost the same in the 2.2 driver so you
> could test it there too (but it won't apply cleanly I think).

I've already done something very similar (basicly copied code from
via_rhine_open() and via_rhine_close(). Here's what i did:

--- /usr/src/linux/drivers/net/via-rhine.c Sat May 13 17:19:21 2000
+++ via-rhine.c.tm2 Sat Jun 3 19:13:03 2000
@@ -819,9 +819,21 @@
 
        /* Stop and restart the chip's Tx processes . */
        /* XXX to do */
+ printk (KERN_DEBUG "%s: Sending CmdStop.\n",
+ dev->name);
+ writew(CmdStop, dev->base_addr + ChipCmd);
+ printk (KERN_DEBUG "%s: Sending CmdStart|CmdTxOn|CmdRxOn|CmdNoTxPoll.\n",
+ dev->name);
+ np->chip_cmd = CmdStart|CmdTxOn|CmdRxOn|CmdNoTxPoll;
+ if (np->duplex_lock)
+ np->chip_cmd |= CmdFDuplex;
+ writew(np->chip_cmd, dev->base_addr + ChipCmd);
 
        /* Trigger an immediate transmit demand. */
        /* XXX to do */
+ printk (KERN_DEBUG "%s: Triggering an immediate transmit demand.\n",
+ dev->name);
+ writew(CmdTxDemand | np->chip_cmd, dev->base_addr + ChipCmd);
 
        dev->trans_start = jiffies;
        np->stats.tx_errors++;

The CmdTxDemand part is the same.

I didn't send any CmdReset (I missed it B-)), but sent a CmdStop.

Yes, the card recovers (I'll try your patch, just let me complete tests
with the GA-7IXE). But it's almost useless. If I keep doing the "color
picker trick", it stops for 4-5 seconds (tcpdump shows that the card
recovers in about a second, but it transmits the wrong packets in the
TCP window, and it takes a few seconds more to send the first one. After
that, the host running the X server ACKs the whole window, as expected.)
it recovers, works for a few seconds, then (if i keep moving the pointer)
it stops again. Useful, since you don't get isolated, but no good for
a network server...
I'd like to know what causes the tx timeout. With our patches (i think the
idea is the same) we're just putting that "ifdown eth0; ifup eth0"
sequence into the driver. Which is ok, if the event is *very* rare, but
is not a solution for the problem, since it happens after few seconds
of a certain kind of (normal) activity. Strange is that:

on the K7V:
- it happens while the X server is running on Linux/Sparc, with 10Mbps eth;
- it happens while the X server is running on Linux/i386, with 100Mbps eth
  (the other card is a DFE530TX, the switch is a D-Link 10/100);
- it does NOT happen while the X server is running on Solaris/Sparc,
  with 10Mbps eth;

so i guess it's not just overloading the card on tx, because it should
handle any 10Mbps traffic easily, i think. The fact that it happens only
while talking to another Linux may indicate a software problem, but:

on the GA-7IXE:
- NEVER happens.

and of course the driver is the same. BTW, during the tests I did months
ago (same card, MB was Asus P5A), I've also tried FreeBSD, which hanged
(can't remember if it crashed, freezed, of just had the card stop: but
I remember it was a failure). So it can't be (only) a driver bug, i think.

On the K7V, I've also played a little with setpci:

With lspci I saw:
# lspci -d 1106:3043 -vv | grep Latency
        Latency: 118 min, 152 max, 64 set, cache line size 08

which is below the min, so tried to raise it:

# setpci -d 1106:3043 latency_timer=80

# lspci -d 1106:3043 -vv | grep Latency
        Latency: 118 min, 152 max, 128 set, cache line size 08

But nothing changed.

Soon I'll get some other network cards (3c905tx, KNE100TX), and I'll test
them on both MBs... and I'll collect more structured data on the K7V & DFE530
combo. Later I'll get a MS-6167.

>
> /Urban
>
>

.TM.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 21:00:14 EST