Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON

From: David Hildenbrand
Date: Tue Jun 17 2025 - 08:10:33 EST

Next message: Luo Jie: "[PATCH 2/8] clk: qcom: ipq5424: Enable NSS NoC clocks to use icc-clk"
Previous message: Luo Jie: "[PATCH 8/8] arm64: defconfig: Build NSS clock controller driver for IPQ5424"
In reply to: Lorenzo Stoakes: "Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON"
Next in thread: Harry Yoo: "Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

+ /* The above check should imply these. */
+ VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
+ VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));

This can trigger in one nasty case, where we can lose the PAE bit during
swapin (refault from the swapcache while the folio is under writeback, and
the device does not allow for modifying the data while under writeback).

Ugh god wasn't aware of that. So maybe drop this second one?

Yes.

+
+ /*
+ * A pinned folio implies that it will be used for a duration longer
+ * than that over which the mmap_lock is held, meaning that another part
+ * of the kernel may be making use of this folio.
+ *
+ * Since we are about to manipulate index & mapping fields, we cannot
+ * safely proceed because whatever has pinned this folio may then
+ * incorrectly assume these do not change.
+ */
+ if (folio_maybe_dma_pinned(folio))
+ goto out;

As discussed, this can race with GUP-fast. SO *maybe* we can just allow for
moving these.

I'm guessing you mean as discussed below? :P Or in the cover letter I've not
read yet? :P

The latter .. IIRC :P It was late ...

Yeah, to be honest you shouldn't be fiddling with index, mapping anyway except
via rmap logic.

I will audit access of these fields just to be safe.

[...]

+
+ state.ptep = ptep_start;
+ for (; !pte_done(&state); pte_next(&state, nr_pages)) {
+ pte_t pte = ptep_get(state.ptep);
+
+ if (pte_none(pte) || !pte_present(pte)) {
+ nr_pages = 1;

What if we have

(a) A migration entry (possibly we might fail migration and simply remap the
original folio)

(b) A swap entry with a folio in the swapcache that we can refault.

I don't think we can simply skip these ...

Good point... will investigate these cases.

migration entries are really nasty ... probably have to wait for the migration entry to become a present pte again.

swap entries ... we could lookup any folio in the swapcache and adjust that.

+ continue;
+ }
+
+ nr_pages = relocate_anon_pte(pmc, &state, undo);
+ if (!nr_pages) {
+ ret = false;
+ goto out;
+ }
+ }
+
+ ret = true;
+out:
+ pte_unmap_unlock(ptep_start, state.ptl);
+ return ret;
+}
+
+static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
+{
+ pud_t *pudp;
+ pmd_t *pmdp;
+ unsigned long extent;
+ struct mm_struct *mm = current->mm;
+
+ if (!pmc->len_in)
+ return true;
+
+ for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
+ pmd_t pmd;
+ pud_t pud;
+
+ extent = get_extent(NORMAL_PUD, pmc);
+
+ pudp = get_old_pud(mm, pmc->old_addr);
+ if (!pudp)
+ continue;
+ pud = pudp_get(pudp);
+
+ if (pud_trans_huge(pud) || pud_devmap(pud))
+ return false;

We don't support PUD-size THP, why to we have to fail here?

This is just to be in line with other 'magical future where we have PUD THP'
stuff in mremap.c.

A later commit that permits huge folio support actually lets us support these...

+
+ extent = get_extent(NORMAL_PMD, pmc);
+ pmdp = get_old_pmd(mm, pmc->old_addr);
+ if (!pmdp)
+ continue;
+ pmd = pmdp_get(pmdp);
+
+ if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
+ pmd_devmap(pmd))
+ return false;

Okay, this case could likely be handled later (present anon folio or
migration entry; everything else, we can skip).

Hmm, but how? the PMD cannot be traversed in this case?

'Present' migration entry? Migration entries are non-present right? :) Or is it
different at PMD?

"present anon folio" or "migration entry" :)

So that latter meant a PMD migration entry (that is non-present)

[...]

pmc.new = new_vma;
+ if (relocate_anon) {
+ lock_new_anon_vma(new_vma);
+ pmc.relocate_locked = new_vma;
+
+ if (!relocate_anon_folios(&pmc, /* undo= */false)) {
+ unsigned long start = new_vma->vm_start;
+ unsigned long size = new_vma->vm_end - start;
+
+ /* Undo if fails. */
+ relocate_anon_folios(&pmc, /* undo= */true);

You'd assume this cannot fail, but I think it can: imagine concurrent
GUP-fast ...

Well if we change the racey code to ignore DMA pinned we should be ok right?

We completely block migration/swapout, or could they happen concurrently? I assume you'd block them already using the rmap locks in write mode.

I really wish we can find a way to not require the fallback.

Yeah the fallback is horrible but we really do need it. See the page table move
fallback code for nightmares also :)

We could also alternatively:

- Have some kind of anon_vma fragmentation where some folios in range reference
a different anon_vma that we link to the original VMA (quite possibly very
broken though).

- Keep a track of folios somehow and separate them from the page table walk (but
then we risk races)

- Have some way of telling the kernel that such a situation exists with a new
object that can be pointed to by folio->mapping, that the rmap code recognise,
like essentially an 'anon_vma migration entry' which can fail.

I already considered combining this operation with the page table move
operation, but the locking gets horrible and the undo is categorically much
worse and I'm not sure it's actually workable.

Yeah, I have to further think about that. :(

--
Cheers,

David / dhildenb

Next message: Luo Jie: "[PATCH 2/8] clk: qcom: ipq5424: Enable NSS NoC clocks to use icc-clk"
Previous message: Luo Jie: "[PATCH 8/8] arm64: defconfig: Build NSS clock controller driver for IPQ5424"
In reply to: Lorenzo Stoakes: "Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON"
Next in thread: Harry Yoo: "Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]