Re: Silent corruption on AMD64

From: Jim Paris
Date: Sat Mar 31 2007 - 23:22:35 EST


Aaron Lehmann wrote:
> I discovered a reproducible way of causing silent file corruption.
...
> 1. Heavy Ethernet load (nc remotehost < /dev/zero)
> 2. Heavy disk write load on any non-sata_sil drive (cat /dev/zero > /path)
> 3. Heavy disk read load on any other drive (tar c /path | cat > /dev/null)

Since it shows up under heavy load that includes unrelated devices, I
think ruling out hardware problems is important. Some suggestions:

- Use mcelog to see if you're getting any machine check exceptions
that would indicate hardware error: http://freshmeat.net/projects/mcelog/

- Use the edac module to turn on pci parity and memory error checks:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/drivers/edac/edac.txt

- Run memtest86+ for several loops to make sure your RAM is ok

- Try moving the SiI card to a different slot

- Try running the SATA drives from a separate power supply

- Move disks and cables around to see whether the problem follows the
disks, the cables, or the controllers

- Try enabling the "spread spectrum" clock option in your BIOS to
reduce EMI

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/