Re: writing file to disk: not as easy as it looks

From: Pavel Machek
Date: Tue Dec 02 2008 - 17:42:32 EST

Next message: Dave Hansen: "Re: [PATCH 2/6] integrity: Linux Integrity Module(LIM)"
Previous message: Rafael J. Wysocki: "Re: Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500 (bisected)"
In reply to: Theodore Tso: "Re: writing file to disk: not as easy as it looks"
Next in thread: Pavel Machek: "Re: writing file to disk: not as easy as it looks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue 2008-12-02 15:55:58, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote:
> > Theodore Tso wrote:
> >
> >> Even for ext3/ext4 which is doing physical journalling, it's still the
> >> case that the journal commits first, and it's only later when the
> >> write happens that we write out the change. If the disk fails some of
> >> the writes, it's possible to lose data, especially if the two blocks
> >> involved in the node split are far apart, and the write to the
> >> existing old btree block fails.
> >
> > Yikes. I was under the impression that once the journal hit the platter
> > then the data were safe (barring media corruption).
>
> Well, this is a case of media corruption (or a cosmic ray hitting
> hitting a ribbon cable in the disk controller sending the write to the
> wrong location on disk, or someone bumping the server causing the disk
> head to lift up a little higher than normal while it was writing the
> disk sector, etc.). But it is a case of the hard drive misbehaving.

I could not parse this. Negation seems to be missing somewhere.

> Heck, if you have a hiccup while writing an inode table block out to
> disk (for example a power failure at just the wrong time), so the
> memory (which is more voltage sensitive than hard drives) DMA's
> garbage which gets written to the inode table, you could lose a large
> number of adjacent inodes when garbage gets splatted over the inode
> table.

Ok, "memory failed before disk" is ... bad hardware.

...but... you seem to be saying that modern filesystems can damage
data even on "sane" hardware.

Lets define sane as:

1) if disk says sector was successfully written, it is so, until you
start writing to that sector again.

(but disk may say "error writing". Filesystem should propagate
that back to the userland, reliably. "Error writing" is
extremely rare on modern disks, but can happen if you run out
of spare blocks.)

(and if you ask for sector write, sector is in undefined
state until drive returns success. Flashes behave like this
-- reads return errors. Do disks?)

2) connection to the disk either works or fails totally. Bit errors
are reliably detected at connection level.

3) power may fail any time.

You seem to be saying that ext2/ext3 only work if these are met:

1) power may fail any time.

2) writes are always successful.

3) connection to the disk always works.

AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
without unmounting (missing fsync error propagation), and it is unsafe
to run ext2/ext3 on any flash-based storage with block interface (SD
cards, flash sticks).

> Ext3 tends to recover from this better than other filesystems, thanks
> to the fact that it does physical block journalling, but you do pay
> for this in terms of performance if you have a metadata-intensive
> workload, because you're writing more bytes to the journal for each
> metadata opeation.

And thanks for that! Actually I'd be willing to pay some more
performance to get reliability up.

> > It seems like the more I learn about filesystems, the more failure modes
> > there are and the fewer guarantees can be made. It's amazing that
> > things work as well as they do...
>
> There are certainly things you can do. Put your fileservers's on
> UPS's. Use RAID. Make backups. Do all three. :-)

I was almost stupid enough to move primary copy of ~ and linux trees
to SD... I do have UPSes, unfortunately they are li-ion and i'm
running off them most of the time. I do have backups, but restoring
them all the time is boring & time consuming. I'll try to stick two
MMC cards into SD slot to make it RAID 1 :-).

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dave Hansen: "Re: [PATCH 2/6] integrity: Linux Integrity Module(LIM)"
Previous message: Rafael J. Wysocki: "Re: Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500 (bisected)"
In reply to: Theodore Tso: "Re: writing file to disk: not as easy as it looks"
Next in thread: Pavel Machek: "Re: writing file to disk: not as easy as it looks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]