Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
From: Lorenzo Stoakes
Date:  Mon Jun 30 2025 - 06:17:33 EST
On Fri, Jun 27, 2025 at 11:46:52AM -0400, Peter Xu wrote:
> Introduce a generic userfaultfd API for vm_operations_struct, so that one
> vma, especially when as a module, can support userfaults without modifying
> the core files.  More importantly, when the module can be compiled out of
> the kernel.
I'm not sure I understand this. One VMA 'especially when as a module'?
This whole sentence is really unclear.
So it seems to me that you want to be able to allow more than just:
- anonymous memory
- shmem
- hugetlb
To support userfaultfd correct?
>
> So, instead of having core mm referencing modules that may not ever exist,
> we need to have modules opt-in on core mm hooks instead.
OK this sounds reasonable.
>
> After this API applied, if a module wants to support userfaultfd, the
> module should only need to touch its own file and properly define
> vm_uffd_ops, instead of changing anything in core mm.
Yes this also sounds reasonable, _as long as_ :) the module authors are
aware of any conditions that might exist in the VMA that might be odd or
confusing.
For instance, if locks need to be held etc.
I am generally slightly wary of us giving VMAs in hooks because people do
crazy things with them that
>
> Note that such API will not work for anonymous. Core mm will process
> anonymous memory separately for userfault operations like before.
Right.
>
> This patch only introduces the API alone so that we can start to move
> existing users over but without breaking them.
Sounds sensible.
>
> Currently the uffd_copy() API is almost designed to be the simplistic with
> minimum mm changes to move over to the API.
A good approach.
>
> Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> ---
>  include/linux/mm.h            |  9 ++++++
>  include/linux/userfaultfd_k.h | 52 +++++++++++++++++++++++++++++++++++
>  2 files changed, 61 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef40f68c1183..6a5447bd43fd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -576,6 +576,8 @@ struct vm_fault {
>  					 */
>  };
>
> +struct vm_uffd_ops;
> +
>  /*
>   * These are the virtual MM functions - opening of an area, closing and
>   * unmapping it (needed to keep files on disk up-to-date etc), pointer
> @@ -653,6 +655,13 @@ struct vm_operations_struct {
>  	 */
>  	struct page *(*find_special_page)(struct vm_area_struct *vma,
>  					  unsigned long addr);
> +#ifdef CONFIG_USERFAULTFD
> +	/*
> +	 * Userfaultfd related ops.  Modules need to define this to support
> +	 * userfaultfd.
> +	 */
> +	const struct vm_uffd_ops *userfaultfd_ops;
> +#endif
>  };
This shouldn't go in vm_ops like this. You're basically changing a
fundamental convention here - that _ops structs define handlers, now you
can have somehow nested ones?
It seems more appropriate to have something like:
struct vm_uffd_ops *(*get_uffd_ops)(struct vm_area_struct *vma);
This then matches e.g. mempolicy's get_policy handler.
>
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index df85330bcfa6..c9a093c4502b 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -92,6 +92,58 @@ enum mfill_atomic_mode {
>  	NR_MFILL_ATOMIC_MODES,
>  };
>
> +/* VMA userfaultfd operations */
Are we sure this should be operating at the VMA level?
I mean are the operations going to be operating on VMAs or VM fault
structs? If not, then this surely belongs to the file operations no?
> +struct vm_uffd_ops {
I'd comment on the naming, but 'vm_operations_struct' is so bad that it
seems we have no sensible convention anyway so this is fine :P
> +	/**
> +	 * @uffd_features: features supported in bitmask.
> +	 *
> +	 * When the ops is defined, the driver must set non-zero features
I don't think the 'when the ops are defined' bit is necessray here, you're
commenting on the ops here, you can safely assume they're defined.
So I'd just say 'must be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.'
> +	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> +	 */
> +	unsigned long uffd_features;
> +	/**
> +	 * @uffd_ioctls: ioctls supported in bitmask.
> +	 *
> +	 * Userfaultfd ioctls supported by the module.  Below will always
> +	 * be supported by default whenever a module provides vm_uffd_ops:
> +	 *
> +	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> +	 *
> +	 * The module needs to provide all the rest optionally supported
> +	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
> +	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> +	 * is optional.
I'm not sure why we need this field? Isn't this implied by whether handlers
are set or NULL?
> +	 */
> +	unsigned long uffd_ioctls;
> +	/**
> +	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> +	 *
> +	 * @inode: the inode for folio lookup
> +	 * @pgoff: the pgoff of the folio
> +	 * @folio: returned folio pointer
> +	 *
> +	 * Return: zero if succeeded, negative for errors.
> +	 */
> +	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> +			      struct folio **folio);
Not sure uufd_ prefix is needed in a struct that's obviously about uffd
already? I'd strip that.
Also this name is really confusing, I think it should contain continue in
some form, such as 'handle_continue()'?
This really feels like it shouldn't be a VMA function as you're dealing
with inode, pgoff, and folio and not VMA at all?
> +	/**
> +	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> +	 *
> +	 * @dst_pmd: target pmd to resolve page fault
> +	 * @dst_vma: target vma
> +	 * @dst_addr: target virtual address
> +	 * @src_addr: source address to copy from
> +	 * @flags: userfaultfd request flags
> +	 * @foliop: previously allocated folio
> +	 *
> +	 * Return: zero if succeeded, negative for errors.
Can you please ensure you put details as to VMA lock state here. Uffd has
some very tricky handling around stuff like this.
> +	 */
> +	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> +			 unsigned long dst_addr, unsigned long src_addr,
> +			 uffd_flags_t flags, struct folio **foliop);
Do we not need a uffd_ctx parameter here?
It seems like we're assuming a _lot_ of mm understanding in the underlying
driver here.
I'm not sure it's really normal to be handing around page table state and
folios etc. to a driver like this, this is really... worrying to me.
This feels like you're trying to put mm functionality outside of mm?
What if we change how we handle page tables in future, or the locking
changes? We might not know to go and update x, y, z driver.
This is concerning.
> +};
> +typedef struct vm_uffd_ops vm_uffd_ops;
> +
>  #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
>  #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
>  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> --
> 2.49.0
>