Re: writing file to disk: not as easy as it looks

From: Pavel Machek
Date: Wed Dec 03 2008 - 03:47:19 EST

On Wed 2008-12-03 00:07:09, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 11:44:03PM +0100, Pavel Machek wrote:
> > > >
> > > > Yikes. I was under the impression that once the journal hit the platter
> > > > then the data were safe (barring media corruption).
> > >
> > > Well, this is a case of media corruption (or a cosmic ray hitting
> > > hitting a ribbon cable in the disk controller sending the write to the
> > > wrong location on disk, or someone bumping the server causing the disk
> > > head to lift up a little higher than normal while it was writing the
> > > disk sector, etc.). But it is a case of the hard drive misbehaving.
> >
> > I could not parse this. Negation seems to be missing somewhere.
> I was agreeing with your original statement. Once the journal hits
> the platter, the data is safe, barring hard drive malfunctions (not
> just media corruption). I was just listing the many other types of
> hard drive failures that could cause data loss.

Aha, ok, sorry for confusion.

> > Ok, "memory failed before disk" is ... bad hardware.
> It's PC class hardware. Live with it. Back when SGI made their own
> hardware, they noticed this problem, and so they wired up their SGI
> machines with powerfail interrupts, and extra big capacitors in their
> power supplies, and when Irix got a powerfail interrupt, it would
> frantically run around aborting DMA transfers to avoid this particular
> problem. At least, that's what an old-timer SGI engineer (who is
> unfortunately no longer at SGI) told me.
> PC class hardware don't have power fail interrupts. Hence, my advice
> to you is that if you use a filesystem that does logical journalling
> --- better have a UPS.

Hmm, 'just avoid logical journalling' seems like a better solution

> > ...but... you seem to be saying that modern filesystems can damage
> > data even on "sane" hardware.
> The example I gave was one where a disk failure could cause a file
> that had previously been sucessfully written to disk and fsync()'ed to
> be damaged by another filesystem operation ***in the face of hard
> drive failure***. Surely that is obvious. The most obvious case of


> The example I gave, where a b-tree is doing split, and there is a
> failure writing to the b-tree causing ancillary damage files
> referenced in the b-tree node getting split, can happen with **any**
> filesystem. The only thing that will save you here would be a
> copy-on-write type filesystem, such as WAFL or btrfs.

ext3-like physical journaling could be extended to handle write
failures (at speed penalty), no?

Write 'I will rewrite block A containing B with C' into journal... ok,
I guess I should wait for btrfs.

> > You seem to be saying that ext2/ext3 only work if these are met:
> >
> > 1) power may fail any time.
> Well, ext2/ext3 will work fine if the power is always reliable, too. :-)

:-) ok.

> > 2) writes are always successful.
> To the extent that write failures while writing filesystem metdata
> can, if you are unluky be catastrophic, yeah. Fortunally normally
> such write failures are fairly rare, but if you worry about such
> things, RAID is the answer. As I said, I believe this is going to be
> true for pretty much any update-in-place filesystem. It's always
> possible to construct failure scenarios if the hardware is unreliable.


> > 3) connection to the disk always works.
> >
> > AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
> > without unmounting (missing fsync error propagation), and it is unsafe
> > to run ext2/ext3 on any flash-based storage with block interface (SD
> > cards, flash sticks).
> The data on the disk before the connection is yanked should be safe
> (although as we mentioned in another thread, the flash drive itself
> may not be happy if you are writing to the Flash Translation Layer at
> the time when power is cut; if that causes a previously written sector
> to disappear, that's an example of a hardware failure that **any**
> filesystem won't necessarily be able to recover from).
> Your definition of "safe" seems to include worrying about making sure
> that all processes that may have previously touched a file or a
> directory gets an error when they try to do an fsync() on that file or
> directory, and that given that fsync clears the error condition after
> it returns it,it is therefore "unsafe".

Yes. fsync() seeems surprisingly high on Rusty's list of broken
interfaces classification ('impossible to use correctly').

I wonder if some reasonable solution exists? Mark filesystem as failed
on first write error is one of those (and default for ext2/3?). Did
SGI/big unixen solve this somehow?

> The reality is that most applications don't proper error checking, and
> even fewer actually call fsync(), so if you are putting your root
> filesytem on a 32G flash card, and it pops out easily due to hardware
> design issues, the question of whether fsync() gets properly progated
> to all potentially interested applications is the ***least*** of your
> worries.

Yes, most applications are bad. Yes, I should just glue the card into
the slot. No, fsync interface does not look properly designed. No, it
is not causing me immediate problems (mount -o dirsync mostly works
around that). I wonder if good, long-term solution exists...

(cesky, pictures)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at