Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching

From: David Hildenbrand
Date: Thu May 08 2025 - 14:01:40 EST

Next message: Jessica Zhang: "Re: [PATCH v9 08/14] drm/msm/dpu: handle pipes as array"
Previous message: Alexei Starovoitov: "Re: [syzbot] [bpf?] possible deadlock in trie_delete_elem"
In reply to: Lorenzo Stoakes: "Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 08.05.25 12:04, Lorenzo Stoakes wrote:

Before getting into the review, just to say thanks for refactoring as per
my (and of course other's) comments, much appreciated and big improvement!
:)

We're getting there...

On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:

To use PTE batching, we want to determine whether the folio mapped by
the PTE is large, thus requiring the use of vm_normal_folio(). We want
to avoid the cost of vm_normal_folio() if the code path doesn't already
require the folio. For arm64, pte_batch_hint() does the job. To generalize
this hint, add a helper which will determine whether two consecutive PTEs
point to consecutive PFNs, in which case there is a high probability that
the underlying folio is large.
Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
do them on the starting and ending contig block.

Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
---
include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
mm/mremap.c | 37 ++++++++++++++++++++++++++++++-------
2 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..38dab1f562ed 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
}
#endif

+/**
+ * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
+ * to a large folio.
+ * @ptep: Pointer to the page table entry.
+ * @pte: The page table entry.
+ *
+ * This helper is invoked when the caller wants to batch over a set of ptes
+ * mapping a large folio, but the concerned code path does not already have
+ * the folio. We want to avoid the cost of vm_normal_folio() only to find that
+ * the underlying folio was small; i.e keep the small folio case as fast as
+ * possible.
+ *
+ * The caller must ensure that ptep + 1 exists.
+ */
+static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
+{
+ pte_t *next_ptep, next_pte;
+
+ if (pte_batch_hint(ptep, pte) != 1)
+ return true;
+
+ next_ptep = ptep + 1;
+ next_pte = ptep_get(next_ptep);
+ if (!pte_present(next_pte))
+ return false;
+
+ return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);

Let's not do unlikely()'s unless we have data for them... it shouldn't mean
'what the programmer believes' :)

+}

Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
(I mean _perhaps_ unavoidably) and we've done the relevant check in
mremap_folio_pte_batch(), so let's just move it there with comments, as this

+
#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
diff --git a/mm/mremap.c b/mm/mremap.c
index 0163e02e5aa8..9c88a276bec4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
return pte;
}

+/* mremap a batch of PTEs mapping the same large folio */
+static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *ptep, pte_t pte, int max_nr)
+{
+ const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+ struct folio *folio;
+ int nr = 1;
+
+ if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
+ folio = vm_normal_folio(vma, addr, pte);
+ if (folio && folio_test_large(folio))
+ nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
+ flags, NULL, NULL, NULL);
+ }

This needs some refactoring, avoid nesting at all costs :)

We'll want to move the maybe_contiguous_pte_pfns() function over here, so
that'll change things, but in general let's use a guard clause.

So an if block like:

if (foo) {
... bunch of logic ...
}

Is better replaced with a guard clause so you have:

if (!foo)
return ...;

... bunch of logic ...

Here we could really expand things out to make things SUPER clear like:

static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
pte_t *ptep, pte_t pte, int max_nr)
{
const fpb_t flags;
struct folio *folio;

if (max_nr == 1)
return 1;
if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
return 1;

folio = vm_normal_folio(vma, addr, pte);
if (!folio)
return 1;
if (!folio_test_large(folio))
return 1;

flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
return folio_pte_batch(folio, addr, ptep, pte, max_nr,
flags, NULL, NULL, NULL);
}

I mean you could argue assign nr would be neater here, but you get the point!

David mentioned a point about this code over in v1 discussion (see
[0]). Trying to bring converastion here to avoid it being split across
old/new series. There he said:

David H:

(2) Do we really need "must be part of the same folio", or could be just batch over present
ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
might be better.

Hm, if we didn't do the batch test, can we batch a split large folio here ok?
I'm guessing we can in which case this check is actually limiting...

Are we _explicitly_ only considering the cont pte case and ignoring the
split THP case?

[0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@xxxxxxxxxx/

And in what circumstances will the hint be set, with a present subsequent
PTE but !folio_test_large()?

I guess the hint might not be taken? But then isn't the valid check just
folio_test_large() and we don't need this batched check at all?

Is it only to avoid the split THP case?

We definitely need some clarity here, and a comment in the code explaining
what's going on as this is subtle stuff.

The whole maybe_contiguous_pte_pfns() is really far from nice. Let's add something like that *only if absolutely required* for performance reasons and on top of this patch.

But let's clarify if we have to limit ourselves to a single folio at all.

Staring at it, I think the tricky bit is:

pte = get_and_clear_full_ptes();

That means that

a) if any PTE is dirty, the resulting PTE will be dirty.
b) if any PTE is young, the resulting PTE will be young.

So we could be marking PTEs as dirty/young from unrelated folios.

That's problematic.

--
Cheers,

David / dhildenb

Next message: Jessica Zhang: "Re: [PATCH v9 08/14] drm/msm/dpu: handle pipes as array"
Previous message: Alexei Starovoitov: "Re: [syzbot] [bpf?] possible deadlock in trie_delete_elem"
In reply to: Lorenzo Stoakes: "Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]