Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API

From: Suren Baghdasaryan
Date: Tue Jul 01 2025 - 13:08:19 EST


On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
>
> On Fri, Jun 27, 2025 at 11:46:52AM -0400, Peter Xu wrote:
> > Introduce a generic userfaultfd API for vm_operations_struct, so that one
> > vma, especially when as a module, can support userfaults without modifying
> > the core files. More importantly, when the module can be compiled out of
> > the kernel.
>
> I'm not sure I understand this. One VMA 'especially when as a module'?
>
> This whole sentence is really unclear.
>
> So it seems to me that you want to be able to allow more than just:
>
> - anonymous memory
> - shmem
> - hugetlb
>
> To support userfaultfd correct?
>
> >
> > So, instead of having core mm referencing modules that may not ever exist,
> > we need to have modules opt-in on core mm hooks instead.
>
> OK this sounds reasonable.
>
> >
> > After this API applied, if a module wants to support userfaultfd, the
> > module should only need to touch its own file and properly define
> > vm_uffd_ops, instead of changing anything in core mm.
>
> Yes this also sounds reasonable, _as long as_ :) the module authors are
> aware of any conditions that might exist in the VMA that might be odd or
> confusing.
>
> For instance, if locks need to be held etc.
>
> I am generally slightly wary of us giving VMAs in hooks because people do
> crazy things with them that
>
> >
> > Note that such API will not work for anonymous. Core mm will process
> > anonymous memory separately for userfault operations like before.
>
> Right.
>
> >
> > This patch only introduces the API alone so that we can start to move
> > existing users over but without breaking them.
>
> Sounds sensible.
>
> >
> > Currently the uffd_copy() API is almost designed to be the simplistic with
> > minimum mm changes to move over to the API.
>
> A good approach.
>
> >
> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > ---
> > include/linux/mm.h | 9 ++++++
> > include/linux/userfaultfd_k.h | 52 +++++++++++++++++++++++++++++++++++
> > 2 files changed, 61 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef40f68c1183..6a5447bd43fd 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -576,6 +576,8 @@ struct vm_fault {
> > */
> > };
> >
> > +struct vm_uffd_ops;
> > +
> > /*
> > * These are the virtual MM functions - opening of an area, closing and
> > * unmapping it (needed to keep files on disk up-to-date etc), pointer
> > @@ -653,6 +655,13 @@ struct vm_operations_struct {
> > */
> > struct page *(*find_special_page)(struct vm_area_struct *vma,
> > unsigned long addr);
> > +#ifdef CONFIG_USERFAULTFD
> > + /*
> > + * Userfaultfd related ops. Modules need to define this to support
> > + * userfaultfd.
> > + */
> > + const struct vm_uffd_ops *userfaultfd_ops;
> > +#endif
> > };
>
> This shouldn't go in vm_ops like this. You're basically changing a
> fundamental convention here - that _ops structs define handlers, now you
> can have somehow nested ones?
>
> It seems more appropriate to have something like:
>
> struct vm_uffd_ops *(*get_uffd_ops)(struct vm_area_struct *vma);
>
> This then matches e.g. mempolicy's get_policy handler.
>
> >
> > #ifdef CONFIG_NUMA_BALANCING
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index df85330bcfa6..c9a093c4502b 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -92,6 +92,58 @@ enum mfill_atomic_mode {
> > NR_MFILL_ATOMIC_MODES,
> > };
> >
> > +/* VMA userfaultfd operations */
>
> Are we sure this should be operating at the VMA level?
>
> I mean are the operations going to be operating on VMAs or VM fault
> structs? If not, then this surely belongs to the file operations no?
>
> > +struct vm_uffd_ops {
>
> I'd comment on the naming, but 'vm_operations_struct' is so bad that it
> seems we have no sensible convention anyway so this is fine :P
>
> > + /**
> > + * @uffd_features: features supported in bitmask.
> > + *
> > + * When the ops is defined, the driver must set non-zero features
>
> I don't think the 'when the ops are defined' bit is necessray here, you're
> commenting on the ops here, you can safely assume they're defined.
>
> So I'd just say 'must be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.'
>
> > + * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> > + */
> > + unsigned long uffd_features;
> > + /**
> > + * @uffd_ioctls: ioctls supported in bitmask.
> > + *
> > + * Userfaultfd ioctls supported by the module. Below will always
> > + * be supported by default whenever a module provides vm_uffd_ops:
> > + *
> > + * _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> > + *
> > + * The module needs to provide all the rest optionally supported
> > + * ioctls. For example, when VM_UFFD_MISSING was supported,
> > + * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> > + * is optional.
>
> I'm not sure why we need this field? Isn't this implied by whether handlers
> are set or NULL?
>
> > + */
> > + unsigned long uffd_ioctls;
> > + /**
> > + * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> > + *
> > + * @inode: the inode for folio lookup
> > + * @pgoff: the pgoff of the folio
> > + * @folio: returned folio pointer
> > + *
> > + * Return: zero if succeeded, negative for errors.
> > + */
> > + int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> > + struct folio **folio);
>
> Not sure uufd_ prefix is needed in a struct that's obviously about uffd
> already? I'd strip that.
>
> Also this name is really confusing, I think it should contain continue in
> some form, such as 'handle_continue()'?
>
> This really feels like it shouldn't be a VMA function as you're dealing
> with inode, pgoff, and folio and not VMA at all?
>
>
> > + /**
> > + * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> > + *
> > + * @dst_pmd: target pmd to resolve page fault
> > + * @dst_vma: target vma
> > + * @dst_addr: target virtual address
> > + * @src_addr: source address to copy from
> > + * @flags: userfaultfd request flags
> > + * @foliop: previously allocated folio
> > + *
> > + * Return: zero if succeeded, negative for errors.
>
> Can you please ensure you put details as to VMA lock state here. Uffd has
> some very tricky handling around stuff like this.
>
> > + */
> > + int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> > + unsigned long dst_addr, unsigned long src_addr,
> > + uffd_flags_t flags, struct folio **foliop);
>
> Do we not need a uffd_ctx parameter here?
>
> It seems like we're assuming a _lot_ of mm understanding in the underlying
> driver here.
>
> I'm not sure it's really normal to be handing around page table state and
> folios etc. to a driver like this, this is really... worrying to me.
>
> This feels like you're trying to put mm functionality outside of mm?

To second that, two things stick out for me here:
1. uffd_copy and uffd_get_folio seem to be at different abstraction
levels. uffd_copy is almost the entire copy operation for VM_SHARED
VMAs while uffd_get_folio is a small part of the continue operation.
2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
last patch is quite a complex function which itself calls some IMO
pretty internal functions like mfill_atomic_install_pte(). Expecting
modules to implement such functionality seems like a stretch to me but
maybe this is for some specialized modules which are written by mm
experts only?

>
> What if we change how we handle page tables in future, or the locking
> changes? We might not know to go and update x, y, z driver.
>
> This is concerning.
>
> > +};
> > +typedef struct vm_uffd_ops vm_uffd_ops;
> > +
> > #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
> > #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
> > #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> > --
> > 2.49.0
> >