Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON

From: David Hildenbrand
Date: Mon Jun 16 2025 - 16:24:49 EST


Hi Lorenzo,

as discussed offline, there is a lot going on an this is rather ... a lot of code+complexity for something that is more a corner cases. :)

Corner-case as in: only select user space will benefit from this, which is really a shame.

After your presentation at LSF/MM, I thought about this further, and I was wondering whether:

(a) We cannot make this semi-automatic, avoiding flags.

(b) We cannot simplify further by limiting it to the common+easy cases first.

I think you already to some degree did b) as part of this non-RFC, which is great.


So before digging into the details, let's discuss the high level problem briefly.

I think there are three parts to it:

(1) Detecting whether it is safe to adjust the folio->index (small
folios)

(2) Performance implications of doing so

(3) Detecting whether it is safe to adjust the folio->index (large PTE-
mapped folios)


Regarding (1), if we simply track whether a folio was ever used for COW-sharing, it would be very easy: and not only for present folios, but for any anon folios that are referenced by swap/migration entries. Skimming over patch #1, I think you apply a similar logic, which is good.

Regarding (2), it would apply when we mremap() anon VMAs and they happen to reside next to other anon VMAs. Which workloads are we concerned about harming by implementing this optimization? I recall that the most common use case for mremap() is actually for file mappings, but I might be wrong. In any case, we could just have a different way to enable this optimization than for each and every mremap() invocation in a process.

Regarding (3), if we were to split large folios that cross VMA boundaries during mremap(), it would be simpler.

How is it handled in this series if we large folio crosses VMA boundaries? (a) try splitting or (b) fail (not transparent to the user :( ).


This also creates a difference in behaviour, often surprising to users,
between mappings which are faulted and those which are not - as for the
latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.

This is problematic firstly because this proliferates kernel allocations
that are pure memory pressure - unreclaimable and unmovable -
i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
remaps which span multiple VMAs (though it does permit remaps that
constitute a part of a single VMA).

If I mremap() to create a hole and mremap() it back, I would assume to automatically get the hole closed again, without special flags. Well, we both know this is not the case :)

> > This means that a user must concern themselves with whether merges succeed
or not should they wish to use mremap() in such a way which causes multiple
mremap() calls to be performed upon mappings.

Right.


This series provides users with an option to accept the overhead of
actually updating the VMA and underlying folios via the
MREMAP_RELOCATE_ANON flag.

Okay. I wish we could avoid this flag ...


If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
the mremap() succeeding, then no attempt is made at relocation of folios as
this is not required.

Makes sense. This is the existing behavior then.


Even if no merge is possible upon moving of the region, vma->vm_pgoff and
folio->index fields are appropriately updated in order that subsequent
mremap() or mprotect() calls will succeed in merging.

By looking at the surrounding VMAs or simply by trying to always keep the folio->index to correspond to the address in the VMA? (just if mremap() never happened, I assume?)


This flag falls back to the ordinary means of mremap() should the operation
not be feasible. It also transparently undoes the operation, carefully
holding rmap locks such that no racing rmap operation encounters incorrect
or missing VMAs.

I absolutely dislike this undo operation, really. :(

I hope we can find a way to just detect early whether this optimization would work.

Which are the exact error cases you can run into for un-doing?

I assume:

(a) cow-shared anon folio (can detect early)

(b) large folios crossing VMAs (TBD)

(c) KSM folios? Probably we could move them, I *think* we would have to update the ksm_rmap_item. Alternatively, we could indicate if a VMA had any KSM folios and give up early in the first version.

(d) GUP pins: I think we could allow that ... folio_maybe_dma_pinned() is racy either way (GUP-fast!). To deal with GUP-fast we would have to play different games ...

Anything else?


In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
user needs to know whether or not the operation succeeded - this flag is
identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
the mremap() fails with -EFAULT.

How would an APP deal with these errors? Do you have a user in mind that could do something sensible based on this error?

I'm having a hard time imagining that :)


Note that no-op mremap() operations (such as an unpopulated range, or a
merge that would trivially succeed already) will succeed under
MREMAP_MUST_RELOCATE_ANON.

mremap() already walks page tables, so it isn't an order of magntitude
increase in workload, but constitutes the need to walk to page table leaf
level and manipulate folios.

Only for anon VMAs, though. Do you have some numbers how bad it is? I mean, mremap() is already a pretty invasive/expensive operation ... :) ... which is why people started using uffdio_move instead, to avoid the heavy-weight locks.


The operations all succeed under THP and in general are compatible with
underlying large folios of any size. In fact, the larger the folio, the
more efficient the operation is.

Yes.


Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
on the same order of magnitude of ordinary mremap() operations, with both
exhibiting time to the proportion of the mapping which is populated.

Of course, mremap() operations that are entirely aligned are significantly
faster as they need only move a VMA and a smaller number of higher order
page tables, but this is unavoidable.

Previous efforts in this area
=============================

An approach addressing this issue was previously suggested by Jakub Matena
in a series posted a few years ago in [0] (and discussed in a masters
thesis).

However this was a more general effort which attempted to always make
anonymous mappings more mergeable, and therefore was not quite ready for
the upstream limelight. In addition, large folio work which has occurred
since requires us to carefully consider and account for this.

This series is more conservative and targeted (one must specific a flag to
get this behaviour) and additionally goes to great efforts to handle large
folios and account all of the nitty gritty locking concerns that might
arise in current kernel code.

Thanks goes out to Jakub for his efforts however, and hopefully this effort
to take a slightly different approach to the same problem is pleasing to
him regardless :)

[0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@xxxxxxxxx/

Use-cases
=========

* ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
upon which makes use of extensive mremap() operations to perform
defragmentation of objects, taking advantage of the plentiful available
virtual address space in a 64-bit system.

In instances where one VMA is faulted in and another not, merging is not
possible, which leads to significant, unreclaimable, kernel metadata
overhead and contention on the vm.max_map_count limit.

This series eliminates the issue entirely.
* It was indicated that Android similarly moves memory around and
encounters the very same issues as ZGC.

Isn't Android using uffdio_move?

* SUSE indicate they have encountered similar issues as pertains to an
internal client.

Past approaches
===============

In discussions at LSF/MM/BPF It was suggested that we could make this an
madvise() operation, however at this point it will be too late to correctly
perform the merge, requiring an unmap/remap which would be egregious.

It was further suggested that we simply defer the operation to the point at
which an mremap() is attempted on multiple immediately adjacent VMAs (that
is - to allow VMA fragmentation up until the point where it might cause
perceptible issues with uAPI).

This is problematic in that in the first instance - you accrue
fragmentation, and only if you were to try to move the fragmented objects
again would you resolve it.

Additionally you would not be able to handle the mprotect() case, and you'd
have the same issue as the madvise() approach in that you'd need to
essentially re-map each VMA.

Additionally it would become non-trivial to correctly merge the VMAs - if
there were more than 3, we would need to invent a new merging mechanism
specifically for this, hold locks carefully over each to avoid them
disappearing from beneath us and introduce a great deal of non-optional
complexity.

While imperfect, the mremap flag approach seems the least invasive most
workable solution (until further rework of the anon_vma mechanism can be
achieved!)

Well, at that point we already have these new flags ... :(


include/linux/rmap.h | 4 +
include/uapi/linux/mman.h | 8 +-
mm/internal.h | 1 +
mm/mremap.c | 719 ++++++-
mm/vma.c | 77 +-
mm/vma.h | 36 +-

~ +40% on LOC on mm/mremap.c :(

--
Cheers,

David / dhildenb