Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

From: Ross Zwisler
Date: Fri Feb 05 2016 - 17:25:17 EST


On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote:
> I think changes aren't very intrusive so we can feed them in during RC
> phase and frankly, you have to move to using ->writepages() anyway to make
> sync(2) work reliably.

I've been looking into this a bit more, and I don't think we actually want to
have DAX flushing in response to sync(2) or syncfs(2). Here's some text from
the BUGS section of the sync(2) man page:

BUGS
According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing
is done. However, since version 1.3.20 Linux does actually wait.
(This still does not guarantee data integrity: modern disks have large
caches.)

Based on this I don't believe that it is a requirement that sync and syncfs
actually flush the data durably to media before they return - they just need
to make sure it has been sent to the device, which is always true for all
writes PMEM via DAX, even without any calls to dax_writeback_mapping_range().

The fsync(2) man page, on the other hand, *does* require that a call to
fsync() will result in the data being durable on the media before the system
call returns. From fsync(2):

fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device (or other permanent storage device)
so that all changed information can be retrieved even after the
system crashed or was rebooted. This includes writing through or
flushing a disk cache if present.

I think we are doing the right thing if we only call
dax_writeback_mapping_range() in response to fsync() and msync(), but skip it
in response to sync() and syncfs(). Practically this also simplifies things a
lot because we don't need to worry about cleaning the DAX inodes. With my
naive implementation where I was calling dax_writeback_mapping_range() in
ext4_writepages(), we were flushing all DAX pages that had ever been dirtied
every 5 seconds in response to a periodic filesystem sync(), which I'm sure is
untenable.

Please let me know if anyone disagrees with any of the above. Based on this,
I'm off to move the calls to dax_writeback_mapping_range() back into the
filesystem specific fsync()/msync() paths so that we can provide an
appropriate block device.