Re: Proposal for "proper" durable fsync() and fdatasync()

From: Jamie Lokier
Date: Tue Feb 26 2008 - 02:55:46 EST


Jeff Garzik wrote:
> Jamie Lokier wrote:
> >By durable, I mean that fsync() should actually commit writes to
> >physical stable storage,
>
> Yes, it should.

Glad we agree :-)

> >I was surprised that fsync() doesn't do this already. There was a lot
> >of effort put into block I/O write barriers during 2.5, so that
> >journalling filesystems can force correct write ordering, using disk
> >flush cache commands.
> >
> >After all that effort, I was very surprised to notice that Linux 2.6.x
> >doesn't use that capability to ensure fsync() flushes the disk cache
> >onto stable storage.
>
> It's surprising you are surprised, given that this [lame] fsync behavior
> has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

> [snip huge long proposal]
>
> Rather than invent new APIs, we should fix the existing ones to _really_
> flush data to physical media.
>
> Linux should default to SAFE data storage, and permit users to retain
> the older unsafe behavior via an option. It's completely ridiculous
> that we default to an unsafe fsync.

Well, I agree with you. Which is why the "new API" I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning "less
safe" variations.

> And [anticipating a common response from others] it is completely
> irrelevant that POSIX fsync(2) permits Linux's current behavior. The
> current behavior is unsafe.
>
> Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular. Heck, you already get people
asking "how to turn off fsync in PostGreSQL"... (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/