Re: [PATCH 7/7] xfs: re-enable XFS per-inode DAX

From: Dave Chinner
Date: Mon Sep 25 2017 - 20:31:23 EST


On Mon, Sep 25, 2017 at 05:14:04PM -0600, Ross Zwisler wrote:
> Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
> any mappings are present.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> ---
> fs/xfs/xfs_ioctl.c | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 386b437..7a24dd5 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1012,12 +1012,10 @@ xfs_diflags_to_linux(
> inode->i_flags |= S_NOATIME;
> else
> inode->i_flags &= ~S_NOATIME;
> -#if 0 /* disabled until the flag switching races are sorted out */
> if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
> inode->i_flags |= S_DAX;
> else
> inode->i_flags &= ~S_DAX;
> -#endif
> }
>
> static bool
> @@ -1049,6 +1047,8 @@ xfs_ioctl_setattr_xflags(
> {
> struct xfs_mount *mp = ip->i_mount;
> uint64_t di_flags2;
> + struct address_space *mapping = VFS_I(ip)->i_mapping;
> + bool dax_changing;
>
> /* Can't change realtime flag if any extents are allocated. */
> if ((ip->i_d.di_nextents || ip->i_delayed_blks) &&
> @@ -1084,10 +1084,23 @@ xfs_ioctl_setattr_xflags(
> if (di_flags2 && ip->i_d.di_version < 3)
> return -EINVAL;
>
> + dax_changing = xfs_is_dax_state_changing(fa->fsx_xflags, ip);
> + if (dax_changing) {
> + i_mmap_lock_read(mapping);
> + if (mapping_mapped(mapping)) {
> + i_mmap_unlock_read(mapping);
> + return -EBUSY;
> + }
> + }
> +
> ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
> ip->i_d.di_flags2 = di_flags2;
>
> xfs_diflags_to_linux(ip);
> +
> + if (dax_changing)
> + i_mmap_unlock_read(mapping);

Is this safe to be taking here under the ILOCK_EXCL? i.e. this is
the lock order here:

IOLOCK_EXCL -> MMAPLOCK_EXCL -> ILOCK_EXCL -> i_mmap_rwsem

The truncate path must run outside the ILOCK
context, and it does this order via unmap_mapping_range:

IOLOCK_EXCL -> MMAPLOCK_EXCL -> i_mmap_rwsem

On a page fault, we do:

mmap_sem -> MMAPLOCK_EXCL -> page lock -> ILOCK_EXCL

Which gives the order

IOLOCK_EXCL
-> mmap_sem
-> MMAPLOCK_EXCL
-> page lock
-> ILOCK_EXCL
-> i_mmap_rwsem

What I'm not clear on is what the orders between page locks and
pte locks and i_mapping_tree_lock and i_mmap_rwsem. If there's any
locks that the filesystem can take above the ILOCK that are also
taken under the i_mmap_rwsem, then we have a deadlock vector.

Historically we've avoided any mm/ level interactions under the
ILOCK_EXCL because of it's location in the page fault path locking
order (e.g. lockdep will go nuts if we take a page fault with the
ILOCK held). Hence I'm extremely wary of putting any other mm/ level
locks under the ILOCK like this without a clear explanation of the
locking orders and why it won't deadlock....

Cheers,

Dave.

--
Dave Chinner
david@xxxxxxxxxxxxx