Re: [PATCH 06/13] xfs: remove XFS_TRANS_NOFS

From: Darrick J. Wong
Date: Fri Jun 28 2019 - 13:42:02 EST


On Fri, Jun 28, 2019 at 07:37:17AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 27, 2019 at 03:30:30PM -0700, Darrick J. Wong wrote:
> > I think the wording of this is too indirect. The reason we need to set
> > NOFS is because we could be doing writeback as part of reclaiming
> > memory, which means that we cannot recurse back into filesystems to
> > satisfy the memory allocation needed to create a transaction. The NOFS
> > part applies to any memory allocation, of course.
> >
> > If you're fine with the wording below I'll just edit that into the
> > patch:
> >
> > /*
> > * We can allocate memory here while doing writeback on behalf of
> > * memory reclaim. To avoid memory allocation deadlocks set the
> > * task-wide nofs context for the following operations.
> > */
> > nofs_flag = memalloc_nofs_save();
>
> Fine with me.
>
> > > trace_xfs_end_io_direct_write(ip, offset, size);
> > > @@ -395,10 +396,11 @@ xfs_dio_write_end_io(
> > > */
> > > XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size);
> > >
> > > + nofs_flag = memalloc_nofs_save();
> >
> > Hmm, do we need this here? I can't remember if there was a need for
> > setting NOFS under xfs_reflink_end_cow from a dio completion or if that
> > was purely the buffered writeback case...
>
> We certainly had to add it for the unwritten extent conversion, maybe
> the corner case just didn't manage to show up for COW yet:

AFAICT the referenced patch solved the problem of writeback ioend
completion deadlocking with memory reclaim by changing the transaction
allocation call in the xfs_iomap_write_unwritten function, which is
called by writeback ioend completion.

However, the directio endio code also calls xfs_iomap_write_unwritten.
I can't tell if NOFS is actually needed in that context, or if we've
just been operating like that for a decade because that's the method
that was chosen to solve the deadlock.

I think this boils down to -- does writeback induced by memory reclaim
ever block on directio?

I don't think it does, but as a counterpoint xfs has been like this for
10 years and there don't seem to be many complaints about directio endio
pushing memory reclaim too hard...

--D

>
> commit 80641dc66a2d6dfb22af4413227a92b8ab84c7bb
> Author: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Date: Mon Oct 19 04:00:03 2009 +0000
>
> xfs: I/O completion handlers must use NOFS allocations
>
> When completing I/O requests we must not allow the memory allocator to
> recurse into the filesystem, as we might deadlock on waiting for the
> I/O completion otherwise. The only thing currently allocating normal
> GFP_KERNEL memory is the allocation of the transaction structure for
> the unwritten extent conversion. Add a memflags argument to
> _xfs_trans_alloc to allow controlling the allocator behaviour.
>
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> Reported-by: Thomas Neumann <tneumann@xxxxxxxxxxxxxxxxxxxxx>
> Tested-by: Thomas Neumann <tneumann@xxxxxxxxxxxxxxxxxxxxx>
> Reviewed-by: Alex Elder <aelder@xxxxxxx>
> Signed-off-by: Alex Elder <aelder@xxxxxxx>
>
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 2d0b3e1da9e6..6f83f58c099f 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -611,7 +611,7 @@ xfs_fs_log_dummy(
> xfs_inode_t *ip;
> int error;
>
> - tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1);
> + tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP);
> error = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0);
> if (error) {
> xfs_trans_cancel(tp, 0);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 67ae5555a30a..7294abce6ef2 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -860,8 +860,15 @@ xfs_iomap_write_unwritten(
> * set up a transaction to convert the range of extents
> * from unwritten to real. Do allocations in a loop until
> * we have covered the range passed in.
> + *
> + * Note that we open code the transaction allocation here
> + * to pass KM_NOFS--we can't risk to recursing back into
> + * the filesystem here as we might be asked to write out
> + * the same inode that we complete here and might deadlock
> + * on the iolock.
> */
> - tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
> + xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> + tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
> tp->t_flags |= XFS_TRANS_RESERVE;
> error = xfs_trans_reserve(tp, resblks,
> XFS_WRITE_LOG_RES(mp), 0,
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 8b6c9e807efb..4d509f742bd2 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1471,7 +1471,7 @@ xfs_log_sbcount(
> if (!xfs_sb_version_haslazysbcount(&mp->m_sb))
> return 0;
>
> - tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT);
> + tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP);
> error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
> XFS_DEFAULT_LOG_COUNT);
> if (error) {
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 66b849358e62..237badcbac3b 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -236,19 +236,20 @@ xfs_trans_alloc(
> uint type)
> {
> xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> - return _xfs_trans_alloc(mp, type);
> + return _xfs_trans_alloc(mp, type, KM_SLEEP);
> }
>
> xfs_trans_t *
> _xfs_trans_alloc(
> xfs_mount_t *mp,
> - uint type)
> + uint type,
> + uint memflags)
> {
> xfs_trans_t *tp;
>
> atomic_inc(&mp->m_active_trans);
>
> - tp = kmem_zone_zalloc(xfs_trans_zone, KM_SLEEP);
> + tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
> tp->t_magic = XFS_TRANS_MAGIC;
> tp->t_type = type;
> tp->t_mountp = mp;
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index ed47fc77759c..a0574f593f52 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -924,7 +924,7 @@ typedef struct xfs_trans {
> * XFS transaction mechanism exported interfaces.
> */
> xfs_trans_t *xfs_trans_alloc(struct xfs_mount *, uint);
> -xfs_trans_t *_xfs_trans_alloc(struct xfs_mount *, uint);
> +xfs_trans_t *_xfs_trans_alloc(struct xfs_mount *, uint, uint);
> xfs_trans_t *xfs_trans_dup(xfs_trans_t *);
> int xfs_trans_reserve(xfs_trans_t *, uint, uint, uint,
> uint, uint);