Re: Data integrity built into the storage stack

From: Martin K. Petersen
Date: Tue Sep 01 2009 - 01:21:01 EST


>>>>> "Greg" == Greg Freemyer <greg.freemyer@xxxxxxxxx> writes:

Greg> We already have the scsi data integrity patches that went in last
Greg> winter and I believe fit into the storage stack below the
Greg> filesystem layer.

The filesystems can actually use it. It's exposed at the bio level.


Greg> I do believe there is a patch floating around for device mapper to
Greg> add some integrity capability.

The patch is in mainline. It allows passthrough so the filesystems can
access the integrity features. But DM itself doesn't use any of them,
it merely acts as a conduit.

DIF is inherently tied to storage device's logical blocks. These are
likely to be smaller than the blocks we're interested in protecting.
However, you could conceivably use the application tag space to add a
checksum with filesystem or MD/DM blocking size granularity. All the
hooks are there.

The application tag space is pretty much only available on disk
drives--array vendors use it for internal purposes. But in the MD/DM
case we're likely to run on raw disk so that's probably ok.

That said, I really think btrfs is the right answer to many of the
concerns raised in this thread. Everything is properly checksummed and
can be verified at read time.

The strength of DIX/DIF is that we can detect corruption at write time
while we still have the buffer we care about write sitting in memory.

So btrfs and DIX/DIF go hand in hand as far as I'm concerned. They
solve different problems but both are squarely aimed at preventing
silent data corruption.

I do agree that we do have to be more prepared for collateral damage
scenarios. As we discussed at LS we have 4KB drives coming out that can
invalidate previously acknowledged I/Os if it gets a subsequent write
failure on a sector. And there's also the issue of fractured writes
when talking to disk arrays. That's really what my I/O topology changes
were all about: Correctness. The fact that they may increase
performance is nice but that was not the main motivator.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/