Re: [RFC] fsnotify: allow sleepable child dentry flag update

From: Stephen Brennan
Date: Tue Nov 01 2022 - 17:47:22 EST


Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:
> On Thu, Oct 13, 2022 at 03:27:19PM -0700, Stephen Brennan wrote:
>
>> So I have two more narrowly scoped strategies to improve the situation. Both are
>> included in the hacky, awful patch below.
>>
>> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
>> is holding the spinlock for several seconds at a time. We can actually achieve
>> this via a cursor, the same way that simple_readdir() is implemented. I think
>> this might require moving the declaration of d_alloc_cursor() and maybe
>> exporting it. I had to #include fs/internal.h which is not ok.
>
> Er... Won't that expose every filesystem to having to deal with cursors?
> Currently it's entirely up to the filesystem in question and I wouldn't
> be a dime on everyone being ready to cope with such surprises...

Hi Al,

I wanted to follow-up on this. Yeah, I didn't realize that this could
cause issues for some filesystems when I wrote it, since I hadn't
considered that rather few filesystems use dcache_readdir(), and so most
aren't prepared to encounter a cursor. Thanks for that catch.

I think I came up with a better solution, which you can see in context
in v3 [1]. The only change I have from the posting there is to eliminate
the unnecssary "child->d_parent != parent" check. So the new idea is to
take a reference to the current child and then go to sleep. That alone
should be enough to prevent dentry_kill() from removing the child from
its parent's list. However, it wouldn't prevent d_move() from moving it
to a new list, nor would it prevent it from moving along the list if it
were a cursor. However, in this situation, we also hold the parent
i_rwsem in write mode, which means that no concurrent rename or readdir
can happen. You can see my full analysis here [2]. So here's the new
approach, if you can call out anything I've missod or confirm that it's
sound, I'd really appreciate it!


/* Must be called with inode->i_rwsem in held in write mode */
static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
{
struct dentry *child, *alias, *last_ref = NULL;
alias = d_find_any_alias(inode);

/*
* These lists can get very long, so we may need to sleep during
* iteration. Normally this would be impossible without a cursor,
* but since we have the inode locked exclusive, we're guaranteed
* that the directory won't be modified, so whichever dentry we
* pick to sleep on won't get moved.
*/
spin_lock(&alias->d_lock);
list_for_each_entry(child, &alias->d_subdirs, d_child) {
if (need_resched()) {
/*
* We need to hold a reference while we sleep. But when
* we wake, dput() could free the dentry, invalidating
* the list pointers. We can't look at the list pointers
* until we re-lock the parent, and we can't dput() once
* we have the parent locked. So the solution is to hold
* onto our reference and free it the *next* time we drop
* alias->d_lock: either at the end of the function, or
* at the time of the next sleep.
*/
dget(child);
spin_unlock(&alias->d_lock);
dput(last_ref);
last_ref = child;
cond_resched();
spin_lock(&alias->d_lock);
}

if (!child->d_inode)
continue;

spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (watched)
child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
else
child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
spin_unlock(&child->d_lock);
}
spin_unlock(&alias->d_lock);
dput(last_ref);
dput(alias);
return watched;
}

Thanks,
Stephen

[1]: https://lore.kernel.org/linux-fsdevel/20221028001016.332663-4-stephen.s.brennan@xxxxxxxxxx/
[2]: https://lore.kernel.org/linux-fsdevel/874jvigye9.fsf@xxxxxxxxxx/