Re: [PATCH] vfs: check dentry is still valid in get_link()

From: Dave Chinner
Date: Mon Jan 17 2022 - 23:13:03 EST


On Tue, Jan 18, 2022 at 03:17:13AM +0000, Al Viro wrote:
> On Tue, Jan 18, 2022 at 02:00:41PM +1100, Dave Chinner wrote:
> > > IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU
> > > callback context?
> >
> > AIUI, not very close at all,
> >
> > I'm pretty sure we can't put it under RCU callback context at all
> > because xfs_fs_destroy_inode() can take sleeping locks, perform
> > transactions, do IO, run rcu_read_lock() critical sections, etc.
> > This means that needs to run an a full task context and so can't run
> > from RCU callback context at all.
>
> Umm... AFAICS, this
> pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> spin_lock(&pag->pag_ici_lock);
> spin_lock(&ip->i_flags_lock);
>
> trace_xfs_inode_set_reclaimable(ip);
> ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> ip->i_flags |= XFS_IRECLAIMABLE;
> xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> XFS_ICI_RECLAIM_TAG);
>
> spin_unlock(&ip->i_flags_lock);
> spin_unlock(&pag->pag_ici_lock);
> xfs_perag_put(pag);
>
> in the end of xfs_inodegc_set_reclaimable() could go into ->free_inode()
> just fine.

No, that just creates a black hole where the VFS inode has been
destroyed but the XFS inode cache doesn't know it's been trashed.
Hence setting XFS_IRECLAIMABLE needs to remain in the during
->destroy_inode, otherwise the ->lookup side of the cache will think
that are currently still in use by the VFS and hand them straight
back out without going through the inode recycling code.

i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS
part of the inode has been torn down, and that it must go back
through VFS re-initialisation before it can be re-instantiated as a
VFS inode.

It would also mean that the inode will need to go through two RCU
grace periods before it gets reclaimed, because XFS uses RCU
protected inode cache lookups internally (e.g. for clustering dirty
inode writeback) and so freeing the inode from the internal
XFS inode cache requires RCU freeing...

> It's xfs_inodegc_queue() I'm not sure about - the part
> about flush_work() in there...

That's the bit that prevents unbound queuing of inodes that require
inactivation because of the problems that causes memory reclaim and
performance. It blocks waiting for xfs_inactive()
calls to complete via xfs_inodegc_worker(), and it's the
xfs_inactive() calls that do all the IO, transactions, locking, etc.

So if we block waiting for them to complete in RCU callback context,
we're effectively inverting the current locking order for entire
filesystem w.r.t. RCU...

Also worth considering: if we are processing the unlink of an inode
with tens or hundreds of millions of extents in xfs_inactive(), that
flush_work() call could block for *minutes* waiting on inactivation
to run the millions of transactions needed to free that inode.
Regardless of lock ordering, we don't want to block RCU grace period
callback work for that long....

> I'm not familiar with that code; could you point me towards some
> docs/old postings/braindump/whatnot?

Commit ab23a7768739 ("xfs: per-cpu deferred inode inactivation
queues") explains the per-cpu queuing mechanism that is being used
in this code.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx