wishful thinking about atomic, multi-sector or full MD stripe width,writes in storage

From: Ric Wheeler
Date: Thu Sep 03 2009 - 10:16:09 EST

On 09/03/2009 09:59 AM, Krzysztof Halasa wrote:
Ric Wheeler<rwheeler@xxxxxxxxxx> writes:

We (red hat) have all kinds of different raid boxes...

A have no doubt about it, but are those you know equipped with
battery-backed write-back cache? Are they using SATA disks?

We can _at_best_ compare non-battery-backed RAID using SATA disks with
what we typically have in a PC.

The whole thread above is about software MD using commodity drives (S-ATA or SAS) without battery backed write cache.

We have that (and I have it personally) and do test it.

You must disable the write cache on these commodity drives *if* the MD RAID level does not support barriers properly.

This will greatly reduce errors after a power loss (both in degraded state and non-degraded state), but it will not eliminate data loss entirely. You simply cannot do that with any storage device!

Note that even without MD raid, the file system issues IO's in file system block size (4096 bytes normally) and most commodity storage devices use a 512 byte sector size which means that we have to update 8 512b sectors.

Drives can (and do) have multiple platters and surfaces and it is perfectly normal to have contiguous logical ranges of sectors map to non-contiguous sectors physically. Imagine a 4KB write stripe that straddles two adjacent tracks on one platter (requiring a seek) or mapped across two surfaces (requiring a head switch). Also, a remapped sector can require more or less a full surface seek from where ever you are to the remapped sector area of the drive.

These are all examples that can after a power loss, even a local (non-MD) device, do a partial update of that 4KB write range of sectors. Note that unlike unlike RAID/MD, local storage has no parity on the server to detect this partial write.

This is why new file systems like btrfs and zfs do checksumming of data and metadata. This won't prevent partial updates during a write, but can at least detect them and try to do some kind of recovery.

In other words, this is not just an MD issue, it is entirely possible even with non-MD devices.

Also, when you enable the write cache (MD or not) you are buffering multiple MB's of data that can go away on power loss. Far greater (10x) the exposure that the partial RAID rewrite case worries about.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/