Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON

From: David Hildenbrand
Date: Mon Jun 16 2025 - 16:42:02 EST


On 16.06.25 22:24, David Hildenbrand wrote:
Hi Lorenzo,

as discussed offline, there is a lot going on an this is rather ... a
lot of code+complexity for something that is more a corner cases. :)

Corner-case as in: only select user space will benefit from this, which
is really a shame.

After your presentation at LSF/MM, I thought about this further, and I
was wondering whether:

(a) We cannot make this semi-automatic, avoiding flags.

(b) We cannot simplify further by limiting it to the common+easy cases
first.

I think you already to some degree did b) as part of this non-RFC, which
is great.


So before digging into the details, let's discuss the high level problem
briefly.

I think there are three parts to it:

(1) Detecting whether it is safe to adjust the folio->index (small
folios)

(2) Performance implications of doing so

(3) Detecting whether it is safe to adjust the folio->index (large PTE-
mapped folios)


Regarding (1), if we simply track whether a folio was ever used for
COW-sharing, it would be very easy: and not only for present folios, but
for any anon folios that are referenced by swap/migration entries.
Skimming over patch #1, I think you apply a similar logic, which is good.

Regarding (2), it would apply when we mremap() anon VMAs and they happen
to reside next to other anon VMAs. Which workloads are we concerned
about harming by implementing this optimization? I recall that the most
common use case for mremap() is actually for file mappings, but I might
be wrong. In any case, we could just have a different way to enable this
optimization than for each and every mremap() invocation in a process.

Regarding (3), if we were to split large folios that cross VMA
boundaries during mremap(), it would be simpler.

How is it handled in this series if we large folio crosses VMA
boundaries? (a) try splitting or (b) fail (not transparent to the user :( ).


This also creates a difference in behaviour, often surprising to users,
between mappings which are faulted and those which are not - as for the
latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.

This is problematic firstly because this proliferates kernel allocations
that are pure memory pressure - unreclaimable and unmovable -
i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
remaps which span multiple VMAs (though it does permit remaps that
constitute a part of a single VMA).

If I mremap() to create a hole and mremap() it back, I would assume to
automatically get the hole closed again, without special flags. Well, we
both know this is not the case :)

> > This means that a user must concern themselves with whether merges
succeed
or not should they wish to use mremap() in such a way which causes multiple
mremap() calls to be performed upon mappings.

Right.


This series provides users with an option to accept the overhead of
actually updating the VMA and underlying folios via the
MREMAP_RELOCATE_ANON flag.

Okay. I wish we could avoid this flag ...


If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
the mremap() succeeding, then no attempt is made at relocation of folios as
this is not required.

Makes sense. This is the existing behavior then.


Even if no merge is possible upon moving of the region, vma->vm_pgoff and
folio->index fields are appropriately updated in order that subsequent
mremap() or mprotect() calls will succeed in merging.

By looking at the surrounding VMAs or simply by trying to always keep
the folio->index to correspond to the address in the VMA? (just if
mremap() never happened, I assume?)


This flag falls back to the ordinary means of mremap() should the operation
not be feasible. It also transparently undoes the operation, carefully
holding rmap locks such that no racing rmap operation encounters incorrect
or missing VMAs.

I absolutely dislike this undo operation, really. :(

I hope we can find a way to just detect early whether this optimization
would work.

Which are the exact error cases you can run into for un-doing?

I assume:

(a) cow-shared anon folio (can detect early)

(b) large folios crossing VMAs (TBD)

(c) KSM folios? Probably we could move them, I *think* we would have to
update the ksm_rmap_item. Alternatively, we could indicate if a VMA had
any KSM folios and give up early in the first version.

Looking at patch #1, I can see that we treat KSM folios as "success".

I would have thought we would have to update the corresponding "ksm_rmap_item" ... somehow, to keep the rmap working.

I know that Wei Yang (already on cc) is working on selftests, which I am yet to review, but he doesn't cover mremap() yet.


Looking at rmap_walk_ksm(), I am left a bit confused.

We walk all entries in the stable tree (ksm_rmap_item), looking in the anon_vma interval tree for the entry that corresponds to ksm_rmap_item->address.

addr = rmap_item->address & PAGE_MASK;

if (addr < vma->vm_start || addr >= vma->vm_end)
continue;

So I would assume, already when we mremap() ... we are *already* breaking KSM rmap walkers? :) Or there is somewhere some magic that I am missing.

A KSM mremap test case for rmap would be nice ;)

--
Cheers,

David / dhildenb