[PATCH] notes on volatile write caches vs fdatasync

From: Christoph Hellwig
Date: Wed Aug 26 2009 - 21:17:11 EST

There are two related issues when dealing with volatile write caches,
the popular and beaten to death one are write barriers to guarantee
write ordering and stable storage for log writes. For this post
I assume naively this works perfectly for all filesystems supporting it.

The second issue are plain cache flush. Yes, they happen to be the
base for the barrier implementation on all common disks in Linux, but
there are cases where we need to issue them even without a log barrier.

Think about a plain write into a file that is already fully allocated.
Or the O_DIRECT version of them same. If we do an fdatasync after these
we really do expect the write to really be on disk, not just in the disk
cache, right? The same is true for O_SYNC, but I ignore it for this
write out as with Jan's patch series O_SYNC writes will be implemented
by a range-fdatasync after the actual write, so after that this sync
section covers it, too.

It appears the following Linux filesystems implement barrier support:

- btrfs
- ext3
- ext4
- gfs2
- nilfs2
- ocfs2
- reiserfs
- xfs

Interestingly of those only ext4, reiserfs and xfs do contain direct
calls to blkdev_issue_flush. And unless a filesystem really creates
a transaction for every write and forces that out on fdatasync it seems
like all others do not actually have a chance to guarantee a cache
flush on fdatasync.

I have tested btrfs, ext3, ext4, reiserfs, and xfs with a simple test
program that just does a buffered write into a file, and then calls
fdatasync. All of the above filesystems issue a barrier request
when the file blocks aren't allocated yet (for ext3 and reiserfs
only when barriers are explicitly enabled, of course).

That's not the case anymore when all blocks are already allocated.
As expected by the above grep results reiserfs and xfs still issue a
barrier in that case. Btrfs also performs a cache flush in every
case which at first seems unexpected due to the lack of any
blkdev_issue_flush call, but given that btrfs is a COW filesystem
it actually has to allocate blocks even for an overwrite.
Ext3 expectedly does not issue a cache flush in that case, but ext4
unexpectedly does not issue a cache flush either. The reason for that
is that it only issues the cache flush if the inode was dirty but
not at all if that is not the case.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/