Re: fsnotify_mark_srcu wtf?

From: Jan Kara
Date: Sat Nov 05 2016 - 17:44:52 EST


On Thu 03-11-16 12:25:13, Amir Goldstein wrote:
> On Thu, Nov 3, 2016 at 12:09 AM, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> > We've got a report where a fanotify daemon that implements permission checks
> > screws up and doesn't send a reply. This then causes widespread hangs due to
> > fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> > called from e.g. inotify_release()-> fsnotify_destroy_group()->
> > fsnotify_mark_destroy_list() to block.
> >
> > Below program demonstrates the issue. It should output a single line:
> >
> > close(inotify_fd): success
> >
> > Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> > the waiting permission event.
> >
> > Wouldn't making the srcu per-group fix this? Would that be too expensive?
> >
>
> IIUC, the purpose of fsnotify_mark_srcu is to protect the inode and vfsmount
> mark lists, which are used to iterate the groups to send events to.
> So you cannot make it per group as far as I can tell.
>
> OTOH, specifically, for fanotify, once you can passed
> fanotify_should_send_event(),
> there is no need to keep a reference to the mark, so it may be safely destroyed,
> you only need a reference to the group.
>
> I tested this patch on top of my fanotify super block watch development branch
> and it seems to fix the problem you reported, but I am not savvy enough with
> srcu to say that this is correct.
> Jan, what do you think?

So the way you've written it is definitely buggy. The inode_node and
vfsmount_node pointers may become invalid once you drop SRCU lock and so
you cannot dereference them even after regaining that lock. Also frankly
your solution looks a bit ugly. I'll think whether we can somehow fix the
problem...

Honza

>
> From 28da34cdf9bf71fe9bbac13ded11a19da3b7a48e Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <amir73il@xxxxxxxxx>
> Date: Thu, 3 Nov 2016 11:55:46 +0200
> Subject: [PATCH] fanotify: handle permission events without holding
> fsnotify_mark_srcu
>
> Handling fanotify events does not entail dereferencing fsnotify_mask
> beyond the point of fanotify_should_send_event().
>
> For the case of permission events, which may block indefinetely,
> return -EAGAIN and then fsnotify() will call handle_event() again
> without a reference to the mark.
>
> Without a reference to the mark, there is no need to call
> handle_event() under fsnotify_mark_srcu read side lock
> which is protecting the inode/vfsmount mark lists.
> We just need to hold a reference to the group
>
> Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> ---
> fs/notify/fanotify/fanotify.c | 14 +++++++++++---
> fs/notify/fsnotify.c | 26 ++++++++++++++++++++++----
> 2 files changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 604e307..0ffa9ed 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -298,9 +298,17 @@ static int fanotify_handle_event(struct
> fsnotify_group *group,
> BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
> BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
>
> - if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask, data
> - data_type))
> - return 0;
> + if (inode_mark || vfsmnt_mark) {
> + if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask,
> + data, data_type))
> + return 0;
> +
> +#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> + /* Ask to be called again without a reference to mark */
> + if (mask & FAN_ALL_PERM_EVENTS)
> + return -EAGAIN;
> +#endif
> + }
>
> pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
> mask);
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 34bccbe..dc0c86e 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -199,7 +199,7 @@ int fsnotify(struct inode *to_tell, __u32 mask,
> void *data, int data_is,
> {
> struct hlist_node *inode_node = NULL, *vfsmount_node = NULL;
> struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL;
> - struct fsnotify_group *inode_group, *vfsmount_group;
> + struct fsnotify_group *inode_group, *vfsmount_group, *group;
> struct mount *mnt;
> int idx, ret = 0;
> /* global tests shouldn't care about events on child only the
> specific event */
> @@ -282,15 +282,33 @@ int fsnotify(struct inode *to_tell, __u32 mask,
> void *data, int data_is,
> ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
> data, data_is, cookie, file_name);
>
> - if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
> - goto out;
> -
> if (inode_group)
> inode_node = srcu_dereference(inode_node->next,
> &fsnotify_mark_srcu);
> if (vfsmount_group)
> vfsmount_node = srcu_dereference(vfsmount_node->next,
> &fsnotify_mark_srcu);
> +
> + if (ret != -EAGAIN)
> + continue;
> +
> + group = inode_group ? : vfsmount_group;
> + fsnotify_get_group(group);
> + srcu_read_unlock(&fsnotify_mark_srcu, idx);
> + /*
> + * Calling handle_event() again without reference to mark,
> + * so we do not need to call it under fsnotify_mark_srcu
> + * which is protecting the inode/vfsmount mark lists.
> + * We just need to hold a reference to the group
> + */
> + return group->ops->handle_event(group, to_tell, NULL, NULL,
> + mask, data, data_is,
> + file_name, cookie);
> + idx = srcu_read_lock(&fsnotify_mark_srcu);
> + fsnotify_put_group(group);
> +
> + if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
> + goto out;
> }
> ret = 0;
> out:
> --
> 2.7.4
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR