Re: [PATCH 28/28] xfs: rework unreferenced inode lookups

From: Brian Foster
Date: Fri Nov 15 2019 - 12:26:10 EST


On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > >
> > > Looking up an unreferenced inode in the inode cache is a bit hairy.
> > > We do this for inode invalidation and writeback clustering purposes,
> > > which is all invisible to the VFS. Hence we can't take reference
> > > counts to the inode and so must be very careful how we do it.
> > >
> > > There are several different places that all do the lookups and
> > > checks slightly differently. Fundamentally, though, they are all
> > > racy and inode reclaim has to block waiting for the inode lock if it
> > > loses the race. This is not very optimal given all the work we;ve
> > > already done to make reclaim non-blocking.
> > >
> > > We can make the reclaim process nonblocking with a couple of simple
> > > changes. If we define the unreferenced lookup process in a way that
> > > will either always grab an inode in a way that reclaim will notice
> > > and skip, or will notice a reclaim has grabbed the inode so it can
> > > skip the inode, then there is no need for reclaim to need to cycle
> > > the inode ILOCK at all.
> > >
> > > Selecting an inode for reclaim is already non-blocking, so if the
> > > ILOCK is held the inode will be skipped. If we ensure that reclaim
> > > holds the ILOCK until the inode is freed, then we can do the same
> > > thing in the unreferenced lookup to avoid inodes in reclaim. We can
> > > do this simply by holding the ILOCK until the RCU grace period
> > > expires and the inode freeing callback is run. As all unreferenced
> > > lookups have to hold the rcu_read_lock(), we are guaranteed that
> > > a reclaimed inode will be noticed as the trylock will fail.
> > >
> > ...
> > >
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > ---
> > > fs/xfs/mrlock.h | 27 +++++++++
> > > fs/xfs/xfs_icache.c | 88 +++++++++++++++++++++--------
> > > fs/xfs/xfs_inode.c | 131 +++++++++++++++++++++-----------------------
> > > 3 files changed, 153 insertions(+), 93 deletions(-)
> > >
> > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
> > > index 79155eec341b..1752a2592bcc 100644
> > > --- a/fs/xfs/mrlock.h
> > > +++ b/fs/xfs/mrlock.h
> > ...
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 11bf4768d491..45ee3b5cd873 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -106,6 +106,7 @@ xfs_inode_free_callback(
> > > ip->i_itemp = NULL;
> > > }
> > >
> > > + mrunlock_excl_non_owner(&ip->i_lock);
> > > kmem_zone_free(xfs_inode_zone, ip);
> > > }
> > >
> > > @@ -132,6 +133,7 @@ xfs_inode_free(
> > > * free state. The ip->i_flags_lock provides the barrier against lookup
> > > * races.
> > > */
> > > + mrupdate_non_owner(&ip->i_lock);
> >
> > Can we tie these into the proper locking interface using flags? For
> > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER)
> > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps?
>
> I'd prefer not to make this part of the common locking interface -
> it's a one off special use case, not something we want to progate
> elsewhere into the code.
>

What urks me about this is that it obfuscates rather than highlights
that fact because I have no idea what mrtryupdate_non_owner() actually
does without looking it up. We could easily name a flag
XFS_ILOCK_PENDING_RECLAIM or something similarly ridiculous to make it
blindingly obvious it should only be used in a special context.

> Now that I think over it, I probably should have tagged this with
> patch with [RFC]. I think we should just get rid of the mrlock
> wrappers rather than add more, and that would simplify this a lot.
>

Yeah, FWIW I've been reviewing this patch as a WIP on top of all of the
nonblocking bits as opposed to being some fundamental part of that work.
That aside, I also agree that cleaning up these wrappers might address
that concern because something like:

/* open code ilock because ... */
down_write_trylock_non_owner(&ip->i_lock);

... is at least readable code.

>
> > > spin_lock(&ip->i_flags_lock);
> > > ip->i_flags = XFS_IRECLAIM;
> > > ip->i_ino = 0;
> > > @@ -295,11 +297,24 @@ xfs_iget_cache_hit(
> > > }
> > >
> > > /*
> > > - * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
> > > - * from stomping over us while we recycle the inode. Remove it
> > > - * from the LRU straight away so we can re-init the VFS inode.
> > > + * Before we reinitialise the inode, we need to make sure
> > > + * reclaim does not pull it out from underneath us. We already
> > > + * hold the i_flags_lock, and because the XFS_IRECLAIM is not
> > > + * set we know the inode is still on the LRU. However, the LRU
> > > + * code may have just selected this inode to reclaim, so we need
> > > + * to ensure we hold the i_flags_lock long enough for the
> > > + * trylock in xfs_inode_reclaim_isolate() to fail. We do this by
> > > + * removing the inode from the LRU, which will spin on the LRU
> > > + * list locks until reclaim stops walking, at which point we
> > > + * know there is no possible race between reclaim isolation and
> > > + * this lookup.
> > > + *
> >
> > Somewhat related to my question about the lru_lock on the earlier patch.
>
> *nod*
>
> The caveat here is that this is the slow path so spinning for a
> while doesn't really matter.
>
> > > @@ -1022,19 +1076,7 @@ xfs_dispose_inode(
> > > spin_unlock(&pag->pag_ici_lock);
> > > xfs_perag_put(pag);
> > >
> > > - /*
> > > - * Here we do an (almost) spurious inode lock in order to coordinate
> > > - * with inode cache radix tree lookups. This is because the lookup
> > > - * can reference the inodes in the cache without taking references.
> > > - *
> > > - * We make that OK here by ensuring that we wait until the inode is
> > > - * unlocked after the lookup before we go ahead and free it.
> > > - *
> > > - * XXX: need to check this is still true. Not sure it is.
> > > - */
> > > - xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > xfs_qm_dqdetach(ip);
> > > - xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >
> > Ok, so I'm staring at this a bit more and think I'm missing something.
> > If we put aside the change to hold ilock until the inode is freed, we
> > basically have the following (simplified) flow as the inode goes from
> > isolation to disposal:
> >
> > ilock (isolate)
> > iflock
> > set XFS_IRECLAIM
> > ifunlock (disposal)
> > iunlock
> > radix delete
> > ilock cycle (drain)
> > rcu free
> >
> > What we're trying to eliminate is the ilock cycle to drain any
> > concurrent unreferenced lookups from accessing the inode once it is
> > freed. The free itself is still RCU protected.
> >
> > Looking over at the ifree path, we now have something like this:
> >
> > rcu_read_lock()
> > radix lookup
> > check XFS_IRECLAIM
> > ilock
> > if XFS_ISTALE, skip
> > set XFS_ISTALE
> > rcu_read_unlock()
> > iflock
> > /* return locked down inode */
>
> You missed a lock.
>
> rcu_read_lock()
> radix lookup
> >>> i_flags_lock
> check XFS_IRECLAIM
> ilock
> if XFS_ISTALE, skip
> set XFS_ISTALE
> >>> i_flags_unlock
> rcu_read_unlock()
> iflock
>
> > Given that we set XFS_IRECLAIM under ilock, would we still need either
> > the ilock cycle or to hold ilock through the RCU free if the ifree side
> > (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the
> > rcu read lock)?
>
> We set XFS_IRECLAIM under the i_flags_lock.
>
> It is the combination of rcu_read_lock() and i_flags_lock() that
> provides the RCU lookup state barriers - the ILOCK is not part of
> that at all.
>
> The key point here is that once we've validated the inode we found
> in the radix tree under the i_flags_lock, we then take the ILOCK,
> thereby serialising the taking of the ILOCK here with the taking of
> the ILOCK in the reclaim isolation code.
>
> i.e. all the reclaim state serialisation is actually based around
> holding the i_flags_lock, not the ILOCK.
>
> Once we have grabbed the ILOCK under the i_flags_lock, we can
> drop the i_flags_lock knowing that reclaim will not be able isolate
> this inode and set XFS_IRECLAIM.
>

Hmm, Ok. I knew i_flags_lock was in there when I wrote this up. I
intentionally left it out as a simplification because it wasn't clear to
me that it was a critical part of the lookup. I'll keep this in mind the
next time I walk through this code.

> > ISTM we should either have a non-reclaim inode with
> > ilock protection or a reclaim inode with RCU protection (so we can skip
> > it before it frees), but I could easily be missing something here..
>
> Heh. Yeah, it's a complex dance, and it's all based around how
> RCU lookups and the i_flags_lock interact to provide coherent
> detection of freed inodes.
>
> I have a nagging feeling that this whole ILOCK-held-to-rcu-free game
> can be avoided. I need to walk myself through the lookup state
> machine again and determine if ordering the XFS_IRECLAIM flag check
> after greabbing the ILOCK is sufficient to prevent ifree/iflush
> lookups from accessing the inode outside the rcu_read_lock()
> context.
>

That's pretty much what I was wondering...

> If so, most of this patch will go away....
>
> > > + * attached to the buffer so we don't need to do anything more here.
> > > */
> > > - if (ip != free_ip) {
> > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > - rcu_read_unlock();
> > > - delay(1);
> > > - goto retry;
> > > - }
> > > -
> > > - /*
> > > - * Check the inode number again in case we're racing with
> > > - * freeing in xfs_reclaim_inode(). See the comments in that
> > > - * function for more information as to why the initial check is
> > > - * not sufficient.
> > > - */
> > > - if (ip->i_ino != inum) {
> > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> >
> > Is there a correctness reason for why we move the stale check to under
> > ilock (in both iflush/ifree)?
>
> It's under the i_flags_lock, and so I moved it up under the lookup
> hold of the i_flags_lock so we don't need to cycle it again.
>

Yeah, but in both cases it looks like it moved to under the ilock as
well, which comes after i_flags_lock. IOW, why grab ilock for stale
inodes when we're just going to skip them?

Brian

> > > /*
> > > - * We don't need to attach clean inodes or those only with unlogged
> > > - * changes (which we throw away, anyway).
> > > + * We don't need to attach clean inodes to the buffer - they are marked
> > > + * stale in memory now and will need to be re-initialised by inode
> > > + * allocation before they can be reused.
> > > */
> > > if (!ip->i_itemp || xfs_inode_clean(ip)) {
> > > ASSERT(ip != free_ip);
> > > xfs_ifunlock(ip);
> > > - xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > + if (ip != free_ip)
> > > + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >
> > There's an assert against this case just above, though I suppose there's
> > nothing wrong with just keeping it and making the functional code more
> > cautious.
>
> *nod*
>
> It follows Darrick's lead of making sure that production kernels
> don't do something stupid because of some whacky corruption we
> didn't expect to ever see.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>