Re: TG3 data corruption (TSO ?)

From: Segher Boessenkool
Date: Fri Sep 08 2006 - 15:29:07 EST


I've been chasing with Segher a data corruption problem lately.
Basically transferring huge amount of data (several Gb) and I get
corrupted data at the rx side. I cannot tell for sure wether what I've
been observing here is the same problem that segher's been seing on is
blades, he will confirm or not. He also seemed to imply that reverting
to an older kernel on the -receiver- side fixed it, which makes me
wonder, since it's looks really like a sending side problem (see
explanation below), if some change in, for exmaple, window scaling,
might hide or trigger it.

Please send me lspci and tg3 probing output so that I know what
tg3 hardware you're using.

I use a 5780 rev. A3, but the problem is not limited to this chip.

I also want to look at the tcpdump or
ethereal on the mirrored port that shows the packet being corrupted.

I don't have such, sorry.

That's all the data I have at this point. I can't guarantee 100% that
it's a TSO bug (it might be a bug that TSO renders visible
due to timing
effects) but it looks like it since I've not reproduced yet with TSO
disabled.

It seems to indeed to only be exposed by TSO, not actually a
bug of it /an sich/.

I've got a patch that seems so solve the problem, it needs more testing
though (maybe Ben can do this :-) ). The problem is that there should
be quite a few wmb()'s in the code that are just not there; adding some
to tg3_set_txd() seems to fix the immediate problem but more is needed
(and I don't see why those should be needed, unless tg3_set_txd() is
updating a life ring entry in place or something like that).

More testing is needed, but the problem is definitely the lack of memory
ordering.


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/