Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API

From: Peter Xu
Date: Thu Jul 03 2025 - 11:49:08 EST


On Thu, Jul 03, 2025 at 08:01:12AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 2, 2025 at 1:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote:
> > > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > > > >
> > > > > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > > > > driver here.
> > > > >
> > > > > I'm not sure it's really normal to be handing around page table state and
> > > > > folios etc. to a driver like this, this is really... worrying to me.
> > > > >
> > > > > This feels like you're trying to put mm functionality outside of mm?
> > > >
> > > > To second that, two things stick out for me here:
> > > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > > > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > > > VMAs while uffd_get_folio is a small part of the continue operation.
> > > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > > > last patch is quite a complex function which itself calls some IMO
> > > > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > > > modules to implement such functionality seems like a stretch to me but
> > > > maybe this is for some specialized modules which are written by mm
> > > > experts only?
> > >
> > > Largely shmem_mfill_atomic_pte() differs from anonymous memory version
> > > (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
> > > whether it's added to the page cache. So instead of uffd_copy(...) we might
> > > add
> > >
> > > int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
> > > void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);
> >
> > Thanks for digging into this, Mike. It's just that IMHO it may not be
> > enough..
> >
> > I actually tried to think about a more complicated API, but more I thought
> > of that, more it looked like an overkill. I can list something here to
> > show why I chose the simplest solution with uffd_copy() as of now.
>
> TBH below does not sound like an overkill to me for keeping mm parts
> to itself without over-exposing them to modules.
>
> >
> > Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem
> > accounting we need to do before allocations, and with/without allocations.
>
> Ok, this results in an additional folio_prealloc() hook.
>
> > That accounting can't be put into folio_alloc() yet even if we'll have one,
> > because we could have the folio allocated when reaching here (that is, when
> > *foliop != NULL). That was a very delicated decision of us to do shmem
> > accounting separately in 2016:
> >
> > https://lore.kernel.org/all/20161216144821.5183-37-aarcange@xxxxxxxxxx/
> >
> > Then, there's also the complexity on how the page cache should be managed
> > for any mem type. For shmem, folio was only injected right before the
> > pgtable installations. We can't inject it when folio_alloc() because then
> > others can fault-in without data populated. It means we at least need one
> > more API to do page cache injections for the folio just got allocated from
> > folio_alloc().
>
> folio_add_to_cache() hook?
>
> >
> > We also may want to have different treatment on how the folio flags are
> > setup. It may not always happen in folio_alloc(). E.g. for shmem right
> > now we do this right before the page cache injections:
> >
> > VM_BUG_ON(folio_test_locked(folio));
> > VM_BUG_ON(folio_test_swapbacked(folio));
> > __folio_set_locked(folio);
> > __folio_set_swapbacked(folio);
> > __folio_mark_uptodate(folio);
> >
> > We may not want to do exactly the same for all the rest mem types. E.g. we
> > likely don't want to set swapbacked for guest-memfd folios. We may need
> > one more API to do it.
>
> Can we do that inside folio_add_to_cache() hook before doing the injection?

The folio can be freed outside this function by userfaultfd callers, so I
wonder if it'll crash the kernel if mm sees a folio locked while being freed.

>
> >
> > Then if to think about failure path, when we have the question above on
> > shmem acct issue, we may want to have yet another post_failure hook doing
> > shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that
> > for guest-memfd, but we still need that for shmem to properly unacct when
> > folio allocation succeeded, but copy_from_user failed somehow.
>
> Failure handling hook.
>
> >
> > Then the question is, do we really need all these fuss almost for nothing?
>
> If that helps to keep modules simpler and mm details contained inside
> the core kernel I think it is worth doing. I imagine if the shmem was
> a module then implementing the current API would require exporting
> functions like mfill_atomic_install_pte(). That seems like
> over-exposure to me. And if we can avoid all the locking and
> refcounting that Liam mentioned that would definitely be worth it
> IMHO.
>
> >
> > Note that if we want, IIUC we can still change this in the future. The
> > initial isolation like this series would still be good to land earlier; we
> > don't even plan to support MISSING for guest-memfd in the near future, but
> > only MINOR mode for now. We don't necessarily need to work out MISSING
> > immediately to move on useful features landing Linux. Even if we'll have
> > it for guest-memfd, it shouldn't be a challenge to maintain both. This is
> > just not a contract we need to sign forever yet.

[1]

> >
> > Hope above clarifies a bit on why I chose the simplest solution as of
> > now. I also don't like this API, as I mentioned in the cover letter. It's
> > really a trade-off I made at least for now the best I can come up with.
> >
> > Said that, if any of us has better solution, please shoot. I'm always open
> > to better alternatives.
>
> I didn't explore this code as deep as you have done and therefore
> might not see all the pitfalls but looks like you already considered
> an alternative which does sound better to me. What are the drawbacks
> that I might be missing?

It'll be a complex API from another angle, that we'll need 5-6 APIs just to
implement uffd missing mode.

Meanwhile it'll also already start to introduce those function jumps into
shmem even if one would be enough to decouple it.

As I mentioned above, I'm also not against doing it, but IMHO that does not
need to be done right now [1], as currently it looks to me guest-memfd may
not really need missing mode traps as of now, and I am actually not sure
whether it will. We may not want to introduce 5-6 APIs with no further
user. We can also always discuss a proper API when the demand arrives.

If uffd_copy is so concerning to people, there's also another alternative
which is to make vm_uffd_ops only support traps except MISSING. uffd_copy
is only about missing traps. Then we can drop uffd_copy API, but the API
itself is then broken on its own.

I still feel like we're over-worried on this. For OOT drivers I never
cared breaking anything. For in-tree ones, this discussion can really
happen when there's a need to. And at that time we can also have a look at
the impl using uffd_copy(), maybe it'll not be that bad either. It seems
to me we don't necessarily need to figure this out right now. IMHO we can
do it two-steps in worst case.

Thanks,

--
Peter Xu