Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache

From: Daniel Xu
Date: Thu Apr 08 2021 - 16:44:20 EST


On Thu, Apr 08, 2021 at 10:19:35AM +0200, Christian Brauner wrote:
> On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> > This commit introduces the bpf page cache iterator. This iterator allows
> > users to run a bpf prog against each page in the "page cache".
> > Internally, the "page cache" is extremely tied to VFS superblock + inode
> > combo. Because of this, iter_pagecache will only examine pages in the
> > caller's mount namespace.
> >
> > Signed-off-by: Daniel Xu <dxu@xxxxxxxxx>
> > ---
> > kernel/bpf/Makefile | 2 +-
> > kernel/bpf/pagecache_iter.c | 293 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 294 insertions(+), 1 deletion(-)
> > create mode 100644 kernel/bpf/pagecache_iter.c

<...>

> >
> > +static int init_seq_pagecache(void *priv_data, struct bpf_iter_aux_info *aux)
> > +{
> > + struct bpf_iter_seq_pagecache_info *info = priv_data;
> > + struct radix_tree_iter iter;
> > + struct super_block *sb;
> > + struct mount *mnt;
> > + void **slot;
> > + int err;
> > +
> > + info->ns = current->nsproxy->mnt_ns;
> > + get_mnt_ns(info->ns);
> > + INIT_RADIX_TREE(&info->superblocks, GFP_KERNEL);
> > +
> > + spin_lock(&info->ns->ns_lock);
> > + list_for_each_entry(mnt, &info->ns->list, mnt_list) {
>
> Not just are there helpers for taking ns_lock
> static inline void lock_ns_list(struct mnt_namespace *ns)
> static inline void unlock_ns_list(struct mnt_namespace *ns)
> they are private to fs/namespace.c because it's the only place that
> should ever walk this list.

Thanks for the hints. Would it be acceptable to add some helpers to
fs/namespace.c to allow walking the list?

IIUC the only way to find a list of mounts is by looking at the mount
namespace. And walking each mount and looking at each `struct
super_node`'s inode's `struct address_space` seemed like the cleanest
way to walkthe page cache.

> This seems buggy: why is it ok here to only take ns_lock and not also
> namespace_sem like mnt_already_visible() and __is_local_mountpoint()
> or the relevant proc iterators? I might be missing something.

Thanks for the hints. I'll take a closer look at the locking. Most
probably I didn't get it right.

I should have also mentioned in the cover letter that I'm fairly sure I
messed up the locking somewhere.

>
> > + sb = mnt->mnt.mnt_sb;
> > +
> > + /* The same mount may be mounted in multiple places */
> > + if (radix_tree_lookup(&info->superblocks, (unsigned long)sb))
> > + continue;
> > +
> > + err = radix_tree_insert(&info->superblocks,
> > + (unsigned long)sb, (void *)1);
> > + if (err)
> > + goto out;
> > + }
> > +
> > + radix_tree_for_each_slot(slot, &info->superblocks, &iter, 0) {
> > + sb = (struct super_block *)iter.index;
> > + atomic_inc(&sb->s_active);
>
> It also isn't nice that you mess with sb->s_active directly.
>
> Imho, this is poking around in a lot of fs/ specific stuff that other
> parts of the kernel should not care about or have access to.

Re above: do you think it'd be appropriate to add more helpers to fs/ ?

<...>

Thanks,
Daniel