Re: File corruption when using kernels 2.6.18+

From: Linus Torvalds
Date: Wed Oct 03 2007 - 23:40:37 EST




On Wed, 3 Oct 2007, Robert Hancock wrote:
>
> Erratum 97: 128-Bit Streaming Stores May Cause Coherency Failure

The Intel-optimized memcpy doesn't use the SSE registers, just regular
32-bit integer nontemporal stores (movnti). The reason is that the SSE
state save is too expensive to be worth it.

So it's not that. Also, considering that it was a single-bit error in all
the cases I saw, I wouldn't expect it to be a cache coherency problem,
which I'd expect to corrupt a whole cacheline or possibly at least a whole
access.

That said, bit corruption can be just about anything. It's certainly not
impossible that it's a CPU bug.

But my first guess would be slightly dodgy motherboard, possibly coupled
with a chipset that simply isn't very tolerant to any timing errors. If
the motherboard traces to the DDR aren't impedance-matched, or if the
traces don't have the same length, or if the capacitors that are supposed
to handle spikes in burst current aren't up to snuff, you'll just get
noisy lines.

And at some point, noisy lines means that you go from reliable operation
to "oh, that bit didn't make it correctly".

Lowering the front-side bus frequency or altering the memory timings can
help (ie doing things like running DDR-333 at DDR-266). Making sure that
your power supply isn't even close to its limits is good. And choosing a
motherboard and chipsets from a reliable manufacturer is more than a good
idea.

The reason why it's interesting that the errors seemed to happen in the
same byte-lane is that I think it's common policy to route data lines on
the same layer, and matching trace length per group is very important,
because you do signal clocking per-group, afaik. But on the other hand,
multiple layers on the board are expensive, so people try to minimize
them, and maybe you end up routing through a via to another layer - which
then makes timing and capacitance harder.

Or there aren't ground lines close enough, or the data lines are too close
to other lines and you get cross-talk etc etc.

No, I've not done board design, and I don't know what I'm talking about,
but look at the interesting zig-zagging the data (and address) lines often
do on the board. It often looks totally crazy ("why doesn't that line just
go straight?"), but the thing is that the groups all need to have the same
length, but the pins are all at different points, so you can't make the
lines straight, or some of them would be much shorter than others.

And if something is border-line, it may work all of the time - *until* you
hit specific patterns that cause lots of lines to wiggle around, and then
a capacitor won't handle the extra current draw from switching, or
cross-talk between lines hits you, and what used to work doesn't work any
more.

I wish we all had ECC memory. That gets rid of a lot of worries.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/