Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API

From: Peter Xu
Date: Wed Jul 02 2025 - 16:23:17 EST


On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote:
> On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > >
> > > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > > driver here.
> > >
> > > I'm not sure it's really normal to be handing around page table state and
> > > folios etc. to a driver like this, this is really... worrying to me.
> > >
> > > This feels like you're trying to put mm functionality outside of mm?
> >
> > To second that, two things stick out for me here:
> > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > VMAs while uffd_get_folio is a small part of the continue operation.
> > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > last patch is quite a complex function which itself calls some IMO
> > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > modules to implement such functionality seems like a stretch to me but
> > maybe this is for some specialized modules which are written by mm
> > experts only?
>
> Largely shmem_mfill_atomic_pte() differs from anonymous memory version
> (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
> whether it's added to the page cache. So instead of uffd_copy(...) we might
> add
>
> int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
> void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);

Thanks for digging into this, Mike. It's just that IMHO it may not be
enough..

I actually tried to think about a more complicated API, but more I thought
of that, more it looked like an overkill. I can list something here to
show why I chose the simplest solution with uffd_copy() as of now.

Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem
accounting we need to do before allocations, and with/without allocations.
That accounting can't be put into folio_alloc() yet even if we'll have one,
because we could have the folio allocated when reaching here (that is, when
*foliop != NULL). That was a very delicated decision of us to do shmem
accounting separately in 2016:

https://lore.kernel.org/all/20161216144821.5183-37-aarcange@xxxxxxxxxx/

Then, there's also the complexity on how the page cache should be managed
for any mem type. For shmem, folio was only injected right before the
pgtable installations. We can't inject it when folio_alloc() because then
others can fault-in without data populated. It means we at least need one
more API to do page cache injections for the folio just got allocated from
folio_alloc().

We also may want to have different treatment on how the folio flags are
setup. It may not always happen in folio_alloc(). E.g. for shmem right
now we do this right before the page cache injections:

VM_BUG_ON(folio_test_locked(folio));
VM_BUG_ON(folio_test_swapbacked(folio));
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
__folio_mark_uptodate(folio);

We may not want to do exactly the same for all the rest mem types. E.g. we
likely don't want to set swapbacked for guest-memfd folios. We may need
one more API to do it.

Then if to think about failure path, when we have the question above on
shmem acct issue, we may want to have yet another post_failure hook doing
shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that
for guest-memfd, but we still need that for shmem to properly unacct when
folio allocation succeeded, but copy_from_user failed somehow.

Then the question is, do we really need all these fuss almost for nothing?

Note that if we want, IIUC we can still change this in the future. The
initial isolation like this series would still be good to land earlier; we
don't even plan to support MISSING for guest-memfd in the near future, but
only MINOR mode for now. We don't necessarily need to work out MISSING
immediately to move on useful features landing Linux. Even if we'll have
it for guest-memfd, it shouldn't be a challenge to maintain both. This is
just not a contract we need to sign forever yet.

Hope above clarifies a bit on why I chose the simplest solution as of
now. I also don't like this API, as I mentioned in the cover letter. It's
really a trade-off I made at least for now the best I can come up with.

Said that, if any of us has better solution, please shoot. I'm always open
to better alternatives.

Thanks,

--
Peter Xu