Re: [PATCH v2 2/9] ext2: tell DAX the size of allocation holes

From: Ross Zwisler
Date: Fri Aug 26 2016 - 17:30:02 EST


On Thu, Aug 25, 2016 at 12:57:28AM -0700, Christoph Hellwig wrote:
> Hi Ross,
>
> can you take at my (fully working, but not fully cleaned up) version
> of the iomap based DAX code here:
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax
>
> By using iomap we don't even have the size hole problem and totally
> get out of the reverse-engineer what buffer_heads are trying to tell
> us business. It also gets rid of the other warts of the DAX path
> due to pretending to be like direct I/O, so this might be a better
> way forward also for ext2/4.

In general I agree that the usage of struct iomap seems more straightforward
than the old way of using struct buffer_head + get_block_t. I really don't
think we want to have two competing DAX I/O and fault paths, though, which I
assume everyone else agrees with as well.

These changes don't remove the things in XFS needed by the old I/O and fault
paths (e.g. xfs_get_blocks_direct() is still there an unchanged). Is the
correct way forward to get buy-in from ext2/ext4 so that they also move to
supporting an iomap based I/O path (xfs_file_iomap_begin(),
xfs_iomap_write_direct(), etc?). That would allow us to have parallel I/O and
fault paths for a while, then remove the old buffer_head based versions when
the three supported filesystems have moved to iomap.

If ext2 and ext4 don't choose to move to iomap, though, I don't think we want
to have a separate I/O & fault path for iomap/XFS. That seems too painful,
and the old buffer_head version should continue to work, ugly as it may be.

Assuming we can get buy-in from ext4/ext2, I can work on a PMD version of the
iomap based fault path that is equivalent to the buffer_head based one I sent
out in my series, and we can all eventually move to that.

A few comments/questions on the implementation:

1) In your mail above you say "It also gets rid of the other warts of the DAX
path due to pretending to be like direct I/O". I assume by this you mean
the code in dax_do_io() around DIO_LOCKING, inode_dio_begin(), etc?
Perhaps there are other things as well in XFS, but this is what I see in
the DAX code. If so, yep, this seems like a win. I don't understand how
DIO_LOCKING is relevant to the DAX I/O path, as we never mix buffered and
direct access.

The comment in dax_do_io() for the inode_dio_begin() call says that it
prevents the I/O from races with truncate. Am I correct that we now get
this protection via the xfs_rw_ilock()/xfs_rw_iunlock() calls in
xfs_file_dax_write()?

2) Just a nit, I noticed that you used "~(PAGE_SIZE - 1)" in several places in
iomap_dax_actor() and iomap_dax_fault() instead of PAGE_MASK. Was this
intentional?

3) It's kind of weird having iomap_dax_fault() in fs/dax.c but having
iomap_dax_actor() and iomap_dax_rw() in fs/iomap.c? I'm guessing the
latter is placed where it is because it uses iomap_apply(), which is local
to fs/iomap.c? Anyway, it would be nice if we could keep them together, if
possible.

4) In iomap_dax_actor() you do this check:

WARN_ON_ONCE(iomap->type != IOMAP_MAPPED);

If we hit this we should bail with -EIO, yea? Otherwise we could write to
unmapped space or something horrible.

5) In iomap_dax_fault, I think the "I/O beyond the end of the file" check
might have been broken. Take for example an I/O to the second page of a
file, where the file has size one page. So:

vmf->pgoff = 1
i_size_read(inode) = 4096

Here's the old code in dax_fault():

size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;

size = (4096 + 4096 - 1) >> PAGE_SHIFT = 1
vmf->pgoff is 1 and size is 1, so we return SIGBUS

Here's the new code:

if (pos >= i_size_read(inode) + PAGE_SIZE - 1)
return VM_FAULT_SIGBUS;

pos = vmf->pgoff << PAGE_SHIFT = 4096
i_size_read(inode) + PAGE_SIZE - 1 = 8193
so, 'pos' isn't >= where we calculate the end of the file to be, so we do I/O

Basically the old check did the "+ PAGE_SIZE - 1" so that the >> PAGE_SHIFT
was sure to round up to the next full page. You don't need this with your
current logic, so I think the test should just be:

if (pos >= i_size_read(inode))
return VM_FAULT_SIGBUS;

Right?

6) Regarding the "we don't even have the size hole problem" comment in your
mail, the current PMD logic requires us to know the size of the hole. This
is important so that we can fault in a huge zero page if we have a 2 MiB
hole. It's fine if that 2 MiB page then gets fragmented into 4k DAX
allocations when we start to do writes, but the path the other way doesn't
work. If we don't know the size of holes then we can't fault in a 2 MiB
zero page, so we'll use 4k zero pages to satisfy reads. This means that if
later we want to fault in a 2MiB DAX allocation, we don't have a single
entry that we can use to lock the entire 2MiB range while we clean the
radix tree an unmap the range from all the user processes. With the
current PMD logic this will mean that if someone does a 4k read that faults
in a 4k zero page, we will only use 4k faults for that range and won't use
PMDs.

The current XFS code in the v4.8 tree tells me the size of the hole, and I
think we need to keep this functionality.