Re: Data integrity built into the storage stack [was: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)]

From: Rob Landley
Date: Sat Aug 29 2009 - 20:35:32 EST

On Saturday 29 August 2009 16:23:50 Greg Freemyer wrote:
> I've read a fair amount of the various threads discussing sector /
> data corruption of flash and raid devices, but by no means all.
> It seems to me the key thing Pavel is highlighting is that many
> storage devices / arrays have reasonably common failure modes where
> data corruption is silently introduced to stable data. I have seen it
> mentioned, but the scsi spec. recently got a "data integrity" option
> that would allow these corruptions to at least be detected if the
> option were in use.
> Regardless, administrators have known forever that a bad cable / bad
> ram / bad controller / etc. can cause data written to a hard drive to
> be written with corrupted values that will not cause a media error on
> read.

Bad ram can do anything if you don't have ECC memory, sure. In my admittedly
limited experience, bad controllers tend not to fail _quietly_, with a problem
writing just this one sector.

I personally have had a tropism for software raid because over the years I've
seen more than one instance of data loss when a proprietary hardware raid card
went bad after years of service and the company couldn't find a sufficiently
similar replacement for the obsolete part capable of reading that strange
proprietary disk format the card was using. (Dell was notorious for this once
upon a time. These days it's a bit more standardized, but I still want to be
sure that I _can_ get the data off from a straight passthrough arrangement
before being _happy_ about it.)

As for bad cables, I believe ATA/33 and higher have checksummed the data going
across the cable for most of a decade now, at least for DMA transfers. (Don't
ask me about about scsi, I mostly didn't use it.)

You've got a 1 in 4 billion chance of a 32 bit checksum magically working out
even with corrupted data, of course, but that's a fluke failure and not a
design problem. And if it got enough failures it would downshift the speed or
drop to PIO mode and it was possible to detect that your hardware was flaky.

These days I'm pretty sure SATA and USB2 are both checksumming the data going
across the cable, because the PHY transcievers those use are descended from
the PHY transcievers originally developed for gigabit ethernet.

PC hardware has always been exactly as cheap and crappy as it could get away
with, but that's a lot less crappy at gigabit speeds and terabyte sizes than
it was in the 16 bit ISA days. We'd be overwhelmed with all the failures
otherwise. (Note that the crappiness of USB flash keys is actually
_surprising_ to some of us , the severity and ease of triggering these failure
modes are beyond what we've come to expect.)

> It seems to me a file system neutral document describing "silent
> corruption of stable data on permanent storage medium" would be
> appropriate. Then the linux kernel can start to be hardened to
> properly respond to situations where the data read is not the data
> written.

According to some of the cool things about
btrfs are:

A) everything is checksummed, because inodes and dentries and data extents are
all just slightly differently tagged entries in one big tree, and every entry
in the tree is checksummed.

2) It has backreferences so you can find most entries from more than one place
if you have to rebuild a damaged tree.

III) The tree update algorithms are lockless so you can potentially run an
fsck/defrag on the sucker in the background, which can among other things re-
read the old data and make sure the checksums match. (So your recommended
regular fsck can in fact be a low priority cron job.)

That wouldn't prevent you from losing data to this sort of corruption (nothing
would), but it does give you potentially better ways to find and deal with it.
Heck, just looking at the stderr output of a simple:

find / -type f -print0 | xargs -0 -n 1 cat > /dev/null

Could potentially tell you something useful if the filesystem is giving you
read errors when the extent checksums don't match. And an rsync could
reliably get read errors (and abort the backup) due to checksum mismatch
instead of copying spans of zeroes over your archival copy.

So far this thread reads to _me_ as an implicit endorsement of btrfs. But so
far any suggestion that one filesystem might handle this problem better than
another has so far been taken as personal attacks against people's babies, so
I haven't asked...

Latency is more important than throughput. It's that simple. - Linus Torvalds
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at