Re: document ext3 requirements

From: Martin K. Petersen
Date: Mon Jan 05 2009 - 14:17:35 EST


>>>>> "Pavel" == Pavel Machek <pavel@xxxxxxx> writes:

>> It is mostly true on SCSI class devices because various UNIX, RAID
>> array and database vendors have spent many years leaning very hard on
>> the drive manufacturers to make it so.
>>
>> But it's not a hard guarantee, you can't get it in writing, and it's
>> not in any of the standards. Hybrid drives with flash had potential
>> to close that particular loophole but those appear to be dead in the
>> water.

Pavel> So "in practice it works but vendors will not guarantee that"?

It works some of the time. But in reality if you yank power halfway
during a write operation the end result is undefined.

The saving grace for normal users is that the potential corruption is
limited to a couple of sectors.

The current suck of flash SSDs is that the erase block size amplifies
this problem by at least one order of magnitude, often two. I have a
couple of SSDs here that will leave my filesystem in shambles every time
the machine crashes. I quickly got tired of reinstalling Fedora several
times per week so now my main machine is back to spinning media.

The people that truly and deeply care about this type of write atomicity
(i.e. enterprises) deploy disk arrays that will do the right thing in
face of an error. This involves NVRAM, mirrored caches, uninterruptible
power supplies, etc. Brute force if you will.

High-end arrays even give you atomicity at a bigger granularity such as
filesystem or database blocks. On some storage you can say "this LUN is
used for an Oracle database that always writes in multiples of 8KB" and
the array will guarantee that each 8KB block of the I/O is written in
its entirety or not at all. Some arrays even allow you to verify Oracle
logical block checksums to ensure that the I/O is intact and internally
consistent.

I have been bugging storage vendors about a per-I/O write atomicity
setting for a while. But it really messes up their pipelining so they
aren't keen on the idea. We may be able to get some of it fixed as a
side-effect of the DIF bits vs. the impending switch to 4KB sectors,
though.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/