Re: [patch] ext2/3: document conditions when reliable operation ispossible

From: Ric Wheeler
Date: Mon Aug 24 2009 - 17:09:23 EST


Pavel Machek wrote:
Hi!

Yep, and at that point you lost data. You had "silent data corruption"
from fs point of view, and that's bad.

It will be probably very bad on XFS, probably okay on Ext3, and
certainly okay on Ext2: you do filesystem check, and you should be
able to repair any damage. So yes, physical journaling is good, but
fsck is better.
I don't see why you think that. In general, fsck (for any fs) only checks metadata. If you have silent data corruption that corrupts things that are fixable by fsck, you most likely have silent corruption hitting things users care about like their data blocks inside of files. Fsck will not fix (or notice) any of that, that is where things like full data checksums can help.

Ok, but in case of data corruption, at least your filesystem does not
degrade further.

Even worse, your data is potentially gone and you have not noticed it... This is why array vendors and archival storage products do periodic scans of all stored data (read all the bytes, compared to a digital signature, etc).
If those filesystem assumptions were not documented, I'd call it
filesystem bug. So better document them ;-).
I think that we need to help people understand the full spectrum of data concerns, starting with reasonable best practices that will help most people suffer *less* (not no) data loss. And make very sure that they are not falsely assured that by following any specific script that they can skip backups, remote backups, etc :-)

Nothing in our code in any part of the kernel deals well with every disaster or odd event.

I can reproduce data loss with ext3 on flashcard in about 40
seconds. I'd not call that "odd event". It would be nice to handle
that, but that is hard. So ... can we at least get that documented
please?

Part of documenting best practices is to put down very specific things that do/don't work. What I worry about is producing too much detail to be of use for real end users.

I have to admit that I have not paid enough attention to this specifics of your ext3 + flash card issue - is it the ftl stuff doing out of order IO's?

Actually, ext2 should be able to survive that, no? Error writing ->
remount ro -> fsck on next boot -> drive relocates the sectors.
I think that the example and the response are both off base. If your head ever touches the platter, you won't be reading from a huge part of your drive ever again (usually, you have 2 heads per platter, 3-4 platters, impact would kill one head and a corresponding percentage of your data).

Ok, that's obviously game over.

This is when you start seeing lots of READ and WRITE errors :-)
It's for this reason that I've never been completely sure how useful
Pavel's proposed treatise about file systems expectations really are
--- because all storage subsystems *usually* provide these guarantees,
but it is the very rare storage system that *always* provides these
guarantees.
Well... there's very big difference between harddrives and flash
memory. Harddrives usually work, and flash memory never does.
It is hard for anyone to see the real data without looking in detail at large numbers of parts. Back at EMC, we looked at failures for lots of parts so we got a clear grasp on trends. I do agree that flash/SSD parts are still very young so we will have interesting and unexpected failure modes to learn to deal with....

_Maybe_ SSDs, being HDD replacements are better. I don't know.

_All_ flash cards (MMC, USB, SD) had the problems. You don't need to
get clear grasp on trends. Those cards just don't meet ext3
expectations, and if you pull them, you get data loss.

Pull them even after an unmount, or pull them hot?
We could just as easily have several kilobytes of explanation in
Documentation/* explaining how we assume that DRAM always returns the
same value that was stored in it previously --- and yet most PC class
hardware still does not use ECC memory, and cosmic rays are a reality.
That means that most Linux systems run on systems that are vulnerable
to this kind of failure --- and the world hasn't ended.

There's a difference. In case of cosmic rays, hardware is clearly
buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
and I still use it. I will not complain if ext3 trashes that.

In case of degraded raid-5, even with perfect hardware, and with
ext3 on top of that, you'll get silent data corruption. Nice, eh?

Clearly, Linux is buggy there. It could be argued it is raid-5's
fault, or maybe it is ext3's fault, but... linux is still buggy.
Nothing is perfect. It is still a trade off between storage utilization (how much storage we give users for say 5 2TB drives), performance and costs (throw away any disks over 2 years old?).

"Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
believe that should be at least documented. (And understand why ZFS is
interesting thing).

Your statement is overly broad - ext3 on a commercial RAID array that does RAID5 or RAID6, etc has no issues that I know of.

Do you know first hand that ZFS works on flash cards?
Ext3 is unsuitable for flash cards and RAID arrays, plain and
simple. It is not documented anywhere :-(. [ext2 should work better --
at least you'll not get silent data corruption.]
ext3 is used on lots of raid arrays without any issue.

And I still use my zaurus with crappy DRAM.

I would not trust raid5 array with my data, for multiple
reasons. The fact that degraded raid5 breaks ext3 assumptions should
really be documented.

Again, you say RAID5 without enough specifics. Are you pointing just at MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 vendor?
I hold ext2/ext3 to higher standards than other filesystem in
tree. I'd not use XFS/VFAT etc.

I would not want people to migrate towards XFS/VFAT, and yes I believe
XFSs/VFATs/... requirements should be documented, too. (But I know too
little about those filesystems).

If you can suggest better wording, please help me. But... those
requirements are non-trivial, commonly not met and the result is data
loss. It has to be documented somehow. Make it as innocent-looking as
you can...

I think that you really need to step back and look harder at real failures - not just your personal experience - but a larger set of real world failures. Many papers have been published recently about that (the google paper, the Bianca paper from FAST, Netapp, etc).

The papers show failures in "once a year" range. I have "twice a
minute" failure scenario with flashdisks.

Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
but I bet it would be on "once a day" scale.

We should document those.
Pavel

Documentation is fine with sufficient, hard data....

ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/