Before getting into the review, just to say thanks for refactoring as per
my (and of course other's) comments, much appreciated and big improvement!
:)
We're getting there...
On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
To use PTE batching, we want to determine whether the folio mapped by
the PTE is large, thus requiring the use of vm_normal_folio(). We want
to avoid the cost of vm_normal_folio() if the code path doesn't already
require the folio. For arm64, pte_batch_hint() does the job. To generalize
this hint, add a helper which will determine whether two consecutive PTEs
point to consecutive PFNs, in which case there is a high probability that
the underlying folio is large.
Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
do them on the starting and ending contig block.
Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
---
include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
mm/mremap.c | 37 ++++++++++++++++++++++++++++++-------
2 files changed, 59 insertions(+), 7 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..38dab1f562ed 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
}
#endif
+/**
+ * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
+ * to a large folio.
+ * @ptep: Pointer to the page table entry.
+ * @pte: The page table entry.
+ *
+ * This helper is invoked when the caller wants to batch over a set of ptes
+ * mapping a large folio, but the concerned code path does not already have
+ * the folio. We want to avoid the cost of vm_normal_folio() only to find that
+ * the underlying folio was small; i.e keep the small folio case as fast as
+ * possible.
+ *
+ * The caller must ensure that ptep + 1 exists.
+ */
+static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
+{
+ pte_t *next_ptep, next_pte;
+
+ if (pte_batch_hint(ptep, pte) != 1)
+ return true;
+
+ next_ptep = ptep + 1;
+ next_pte = ptep_get(next_ptep);
+ if (!pte_present(next_pte))
+ return false;
+
+ return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
Let's not do unlikely()'s unless we have data for them... it shouldn't mean
'what the programmer believes' :)
+}
Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
(I mean _perhaps_ unavoidably) and we've done the relevant check in
mremap_folio_pte_batch(), so let's just move it there with comments, as this
+
#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
diff --git a/mm/mremap.c b/mm/mremap.c
index 0163e02e5aa8..9c88a276bec4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
return pte;
}
+/* mremap a batch of PTEs mapping the same large folio */
+static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *ptep, pte_t pte, int max_nr)
+{
+ const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+ struct folio *folio;
+ int nr = 1;
+
+ if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
+ folio = vm_normal_folio(vma, addr, pte);
+ if (folio && folio_test_large(folio))
+ nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
+ flags, NULL, NULL, NULL);
+ }
This needs some refactoring, avoid nesting at all costs :)
We'll want to move the maybe_contiguous_pte_pfns() function over here, so
that'll change things, but in general let's use a guard clause.
So an if block like:
if (foo) {
... bunch of logic ...
}
Is better replaced with a guard clause so you have:
if (!foo)
return ...;
... bunch of logic ...
Here we could really expand things out to make things SUPER clear like:
static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
pte_t *ptep, pte_t pte, int max_nr)
{
const fpb_t flags;
struct folio *folio;
if (max_nr == 1)
return 1;
if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
return 1;
folio = vm_normal_folio(vma, addr, pte);
if (!folio)
return 1;
if (!folio_test_large(folio))
return 1;
flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
return folio_pte_batch(folio, addr, ptep, pte, max_nr,
flags, NULL, NULL, NULL);
}
I mean you could argue assign nr would be neater here, but you get the point!
David mentioned a point about this code over in v1 discussion (see
[0]). Trying to bring converastion here to avoid it being split across
old/new series. There he said:
David H:
(2) Do we really need "must be part of the same folio", or could be just batch over present
ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
might be better.
Hm, if we didn't do the batch test, can we batch a split large folio here ok?
I'm guessing we can in which case this check is actually limiting...
Are we _explicitly_ only considering the cont pte case and ignoring the
split THP case?
[0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@xxxxxxxxxx/
And in what circumstances will the hint be set, with a present subsequent
PTE but !folio_test_large()?
I guess the hint might not be taken? But then isn't the valid check just
folio_test_large() and we don't need this batched check at all?
Is it only to avoid the split THP case?
We definitely need some clarity here, and a comment in the code explaining
what's going on as this is subtle stuff.