Re: [PATCH] vfs: check dentry is still valid in get_link()

From: Al Viro
Date: Tue Jan 18 2022 - 14:20:53 EST


On Tue, Jan 18, 2022 at 01:25:15PM -0500, Brian Foster wrote:

> If I go back to the inactive -> reclaimable grace period variant and
> also insert a start_poll_synchronize_rcu() and
> poll_state_synchronize_rcu() pair across the inactive processing
> sequence, I start seeing numbers closer to ~36k cycles. IOW, the
> xfs_inodegc_inactivate() helper looks something like this:
>
> if (poll_state_synchronize_rcu(ip->i_destroy_gp))
> xfs_inodegc_set_reclaimable(ip);
> else
> call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback);
>
> ... to skip the rcu grace period if one had already passed while the
> inode sat on the inactivation queue and was processed.
>
> However my box went haywire shortly after with rcu stall reports or
> something if I let that continue to run longer, so I'll probably have to
> look into that lockdep splat (complaining about &pag->pag_ici_lock in
> rcu context, perhaps needs to become irq safe?) or see if something else
> is busted..

Umm... Could you (or Dave) describe where does the mainline do the
RCU delay mentioned several times in these threads, in case of
* unlink()
* overwriting rename()
* open() + unlink() + close() (that one, obviously, for regular files)

The thing is, if we already do have an RCU delay in there, there might be
a better solution - making sure it happens downstream of d_drop() (in case
when dentry has other references) or dentry going negative (in case we are
holding the sole reference to it).

If we can do that, RCU'd dcache lookups won't run into inode reuse at all.
Sure, right now d_delete() comes last *and* we want the target inode to stay
around past the call of ->unlink(). But if you look at do_unlinkat() you'll
see an interesting-looking wart with ihold/iput around vfs_unlink(). Not
sure if you remember the story on that one; basically, it's "we'd rather
have possible IO on inode freeing to happen outside of the lock on parent".

nfsd and mqueue do the same thing; ksmbd does not. OTOH, ksmbd appears to
force the "make victim go unhashed, sole reference or not". ecryptfs
definitely does that forcing (deliberately so).

That area could use some rethinking, and if we can deal with the inode reuse
delay while we are at it...

Overwriting rename() is also interesting in that respect, of course.

I can go and try to RTFS my way through xfs iget-related code, but I'd
obviously prefer to do so with at least some overview of that thing
from the folks familiar with it. Seeing that it's a lockless search
structure, missing something subtle there is all too easy, especially
with the lookup-by-fhandle stuff in the mix...