Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
From: Nikita Kalyazin
Date: Fri Jun 20 2025 - 08:00:51 EST
On 11/06/2025 13:56, Peter Xu wrote:
On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
On 10/06/2025 23:22, Peter Xu wrote:
On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
Remove shmem-specific code from UFFDIO_CONTINUE implementation for
non-huge pages by calling vm_ops->fault(). A new VMF flag,
FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or
some mention of the need to introduce userfaultfd support in kernel
modules.
Hi Peter,
Sounds fair, thank you.
Suggested-by: James Houghton <jthoughton@xxxxxxxxxx>
Signed-off-by: Nikita Kalyazin <kalyazin@xxxxxxxxxx>
---
include/linux/mm_types.h | 4 ++++
mm/hugetlb.c | 2 +-
mm/shmem.c | 9 ++++++---
mm/userfaultfd.c | 37 +++++++++++++++++++++++++++----------
4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..2f26ee9742bf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
* We should only access orig_pte if this flag set.
* @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
+ * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
+ * minor handler as it is being called by the
+ * userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
as it's not easy to follow when without userfault minor context. It might
be better to use generic terms like NO_USERFAULT.
Yes, I agree, can name it more generically.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter
patch, whether we can also avoid reusing fault() but instead resolve the
page faults using the vm_ops hook too. That might be helpful because then
we can avoid this new FAULT_FLAG_* that is totally not useful to
non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
struct below just to suite the current fault() interfacing.
I'm not sure I fully understand that. Calling fault() op helps us reuse the
FS specifics when resolving the fault. I get that the new op can imply the
userfault flag so the flag doesn't need to be exposed to mm, but doing so
will bring duplication of the logic within FSes between this new op and the
fault(), unless we attempt to factor common parts out. For example, for
shmem_get_folio_gfp(), we would still need to find a way to suppress the
call to handle_userfault() when shmem_get_folio_gfp() is called from the new
op. Is that what you're proposing?
Yes it is what I was proposing. shmem_get_folio_gfp() always has that
handling when vmf==NULL, then vma==NULL and userfault will be skipped.
So what I was thinking is one vm_ops.userfaultfd_request(req), where req
can be:
(1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
both MISSING/WP/MINOR. Here WP should mean sync-wp tracking, async
was so far by default almost supported everywhere except
VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
as of now (even if I think it shouldn't be hard to support the rest
two..).
(2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
defined to suite userfault's need on fault resolutions. It likely
doesn't need vmf as the parameter, but likely (when anon isn't taking
into account, after all anon have vm_ops==NULL..) the inode and
offsets, perhaps some flag would be needed to identify MISSING or
MINOR faults, for example.
Maybe some more.
I was even thinking whether we could merge hugetlb into the picture too on
generalize its fault resolutions. Hugetlb was always special, maye this is
a chance too to make it generalized, but it doesn't need to happen in one
shot even if it could work. We could start with shmem.
So this does sound like slightly involved, and I'm not yet 100% sure this
will work, but likely. If you want, I can take a stab at this this week or
next just to see whether it'll work in general. I also don't expect this
to depend on guest-memfd at all - it can be alone a refactoring making
userfault module-ready.
Thanks for explaining that. I played a bit with it myself and it
appears to be working for the MISSING mode for both shmem and
guest_memfd. Attaching my sketch below. Please let me know if this is
how you see it.
I found that arguments and return values are significantly different
between the two request types, which may be a bit confusing, although we
do not expect many callers of those.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..eb30b23b24d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -603,6 +603,16 @@ struct vm_fault {
*/
};
+#ifdef CONFIG_USERFAULTFD
+/*
+ * Used by userfaultfd_request().
+ */
+enum uffd_req {
+ UFFD_REQ_GET_SUPPORTED, /* query supported userfaulfd modes */
+ UFFD_REQ_FAULT_RESOLVE, /* request fault resolution */
+};
+#endif
+
/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -680,6 +690,22 @@ struct vm_operations_struct {
*/
struct page *(*find_special_page)(struct vm_area_struct *vma,
unsigned long addr);
+
+#ifdef CONFIG_USERFAULTFD
+ /*
+ * Called by the userfaultfd code to query supported modes or request
+ * fault resolution.
+ * If called with req UFFD_REQ_GET_SUPPORTED, it returns a bitmask
+ * of modes as in struct uffdio_register. No other arguments are
+ * used.
+ * If called with req UFFD_REQ_FAULT_RESOLVE, it resolves the fault
+ * using the mode specified in the mode argument. The inode, pgoff and
+ * foliop arguments must be set accordingly.
+ */
+ int (*userfaultfd_request)(enum uffd_req req, int mode,
+ struct inode *inode, pgoff_t pgoff,
+ struct folio **foliop);
+#endif
};
#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 75342022d144..1cabb925da0e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -222,7 +222,11 @@ static inline bool vma_can_userfault(struct
vm_area_struct *vma,
return false;
if ((vm_flags & VM_UFFD_MINOR) &&
- (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+ (!is_vm_hugetlb_page(vma) &&
+ !vma->vm_ops->userfaultfd_request &&
+ !(vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+ NULL, 0, NULL) &
+ UFFDIO_REGISTER_MODE_MINOR)))
return false;
/*
@@ -243,8 +247,11 @@ static inline bool vma_can_userfault(struct
vm_area_struct *vma,
#endif
/* By default, allow any of anon|shmem|hugetlb */
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
+ return vma_is_anonymous(vma) ||
+ is_vm_hugetlb_page(vma) ||
+ (vma->vm_ops->userfaultfd_request &&
+ vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, NULL,
+ 0, NULL));
}
static inline bool vma_has_uffd_without_event_remap(struct
vm_area_struct *vma)
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..a5b5c4131dcf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5203,6 +5203,40 @@ static int shmem_error_remove_folio(struct
address_space *mapping,
return 0;
}
+#ifdef CONFIG_USERFAULTFD
+static int shmem_userfaultfd_request(enum uffd_req req, int mode,
+ struct inode *inode, pgoff_t pgoff,
+ struct folio **foliop)
+{
+ int ret;
+
+ switch (req) {
+ case UFFD_REQ_GET_SUPPORTED:
+ ret =
+ UFFDIO_REGISTER_MODE_MISSING |
+ UFFDIO_REGISTER_MODE_WP |
+ UFFDIO_REGISTER_MODE_MINOR;
+ break;
+ case UFFD_REQ_FAULT_RESOLVE:
+ ret = shmem_get_folio(inode, pgoff, 0, foliop, SGP_NOALLOC);
+ if (ret == -ENOENT)
+ ret = -EFAULT;
+ if (ret)
+ break;
+ if (!*foliop) {
+ ret = -EFAULT;
+ break;
+ }
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+#endif
+
static const struct address_space_operations shmem_aops = {
.writepage = shmem_writepage,
.dirty_folio = noop_dirty_folio,
@@ -5306,6 +5340,9 @@ static const struct vm_operations_struct
shmem_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .userfaultfd_request = shmem_userfaultfd_request,
+#endif
};
static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5315,6 +5352,9 @@ static const struct vm_operations_struct
shmem_anon_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .userfaultfd_request = shmem_userfaultfd_request,
+#endif
};
int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..efc150bf5691 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -392,16 +392,18 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
struct page *page;
int ret;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
- /* Our caller expects us to return -EFAULT if we failed to find folio */
- if (ret == -ENOENT)
- ret = -EFAULT;
+ if (!dst_vma->vm_ops->userfaultfd_request ||
+ !(dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+ NULL, 0, NULL) &
+ UFFDIO_REGISTER_MODE_MINOR)) {
+ return -EFAULT;
+ }
+
+ ret = dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_FAULT_RESOLVE,
+ UFFDIO_REGISTER_MODE_MINOR,
+ inode, pgoff, &folio);
if (ret)
goto out;
- if (!folio) {
- ret = -EFAULT;
- goto out;
- }
page = folio_file_page(folio, pgoff);
if (PageHWPoison(page)) {
@@ -770,10 +772,10 @@ static __always_inline ssize_t mfill_atomic(struct
userfaultfd_ctx *ctx,
return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
- goto out_unlock;
- if (!vma_is_shmem(dst_vma) &&
- uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ if (!vma_is_anonymous(dst_vma) &&
+ (!dst_vma->vm_ops->userfaultfd_request ||
+ (!dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+ NULL, 0, NULL))))
goto out_unlock;
while (src_addr < src_start + len) {
Thanks,
--
Peter Xu