Re: REQ_FLUSH, REQ_FUA and open/close of block devices

From: Alex Bligh
Date: Sun May 22 2011 - 07:17:35 EST


Christoph,

--On 22 May 2011 06:44:49 -0400 Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:

On Sat, May 21, 2011 at 09:42:45AM +0100, Alex Bligh wrote:
What I am concerned about is that relatively normal actions (e.g. unmount
a filing system) do not appear to be flushing all data, even though I
did "sync" then "umount". I suspect the sync is generating the FLUSH
here, and nothing is flushing the umount writes. How can I know as a
block device that I have to write out a (long lasting) writeback cache if
I don't receive anything beyond the last WRITE?

In your case it seems like ext3 is doing something wrong. If you
run the same on XFS, you should not only see the last real write
having FUA and FLUSH as it's a transaction commit, but also an
explicit cache flush when devices are closed from the filesystem
to work around issues like that.

OK. Sounds like an ext3 bug then. I will test with xfs, ext4 and btrfs
and see if they exhibit the same symptoms, and come back with a more
appropriate subject line.

But the raw block device node
really doesn't behave different from a file and shouldn't cause
any fsync on close.

Fair enough. I will check whether the hypervisor concerned is doing
an fsync() or equivalent in the right place.

Btw, using sync_file_range is a really bad idea. It will not actually
flush the disk cache on the server, nor make sure metadata is commited in
case of a sparse or preallocated file, and thus does not implement
the FLUSH or FUA semantics correctly.

And btw, I'd like to know what makes sync_file_range so tempting,
even after I added documentation explaining why it's almost always
wrong to use it to the man page.

I think you are referring to this (which in my defence wasn't in my
local copy of the manpage).

This system call is extremely dangerous and should not be used in
portable programs. None of these operations writes out the file's
metadata. Therefore, unless the application is strictly performing
overwrites of already- instantiated disk blocks, there are no
guarantees that the data will be available after a crash. There is no
user interface to know if a write is purely an overwrite. On file
systems using copy-on-write semantics (e.g., btrfs) an overwrite of
existing allocated blocks is impossible. When writing into preallocated
space, many file systems also require calls into the block allocator,
which this system call does not sync out to disk. This system call
does not flush disk write caches and thus does not provide any data
integrity on systems with volatile disk write caches.

So, the file in question is not mmap'd (it's an nbd disk). fsync() /
fdatasync() is too expensive as it will sync everything. As far as I can
tell, this is no more dangerous re metadata than fdatasync() which also
does not sync metadata. I had read the last sentence as "this system
call does not *necessarily* flush disk write caches" (meaning "if you
haven't mounted e.g. ext3 with barriers=1, then you can't ensure write
caches write through"), as opposed to "will not ever flush disk write
caches", and given mounting ext3 without barriers=1 produces no FUA or
FLUSH commands in normal operation anyway (as far as light debugging
can see) that's not much of a loss.

But rather than trying to justify myself: what is the best way to
emulate FUA, i.e. ensure a specific portion of a file is synced before
returning, without ensuring the whole lot is synced (which is far too
slow)? The only other option I can see is to open the file with a second
fd, mmap the chunk of the file (it may be larger than the available
virtual address space), mysnc it with MS_SYNC, then fsync, then munmap
and close, and hope the fsync doesn't spit anything else out. This
seems a little excessive, and I don't even know whether it would work.

I guess given NBD currently does nothing at all to support barriers,
I thought this was an improvement!

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/