Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation

From: David Hildenbrand
Date: Wed Jun 25 2025 - 06:01:05 EST

Next message: Peter Zijlstra: "Re: [PATCH 1/8] Introduce simple hazard pointers"
Previous message: Baochen Qiang: "Re: [PATCH v2 4/5] wifi: ath11k: fix source ring-buffer corruption"
In reply to: Barry Song: "Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation"
Next in thread: Barry Song: "Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 24.06.25 18:25, Lance Yang wrote:

On 2025/6/24 23:34, David Hildenbrand wrote:

On 24.06.25 17:26, Lance Yang wrote:

On 2025/6/24 20:55, David Hildenbrand wrote:

On 14.02.25 10:30, Barry Song wrote:

From: Barry Song <v-songbaohua@xxxxxxxx>

[...]

diff --git a/mm/rmap.c b/mm/rmap.c
index 89e51a7a9509..8786704bd466 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1781,6 +1781,25 @@ void folio_remove_rmap_pud(struct folio *folio,
struct page *page,
#endif
}
+/* We support batch unmapping of PTEs for lazyfree large folios */
+static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
+ struct folio *folio, pte_t *ptep)
+{
+ const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+ int max_nr = folio_nr_pages(folio);

Let's assume we have the first page of a folio mapped at the last page
table entry in our page table.

Good point. I'm curious if it is something we've seen in practice ;)

I challenge you to write a reproducer :P I assume it might be doable
through simple mremap().

What prevents folio_pte_batch() from reading outside the page table?

Assuming such a scenario is possible, to prevent any chance of an
out-of-bounds read, how about this change:

diff --git a/mm/rmap.c b/mm/rmap.c
index fb63d9256f09..9aeae811a38b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1852,6 +1852,25 @@ static inline bool
can_batch_unmap_folio_ptes(unsigned long addr,
const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
int max_nr = folio_nr_pages(folio);
pte_t pte = ptep_get(ptep);
+ unsigned long end_addr;
+
+ /*
+ * To batch unmap, the entire folio's PTEs must be contiguous
+ * and mapped within the same PTE page table, which corresponds to
+ * a single PMD entry. Before calling folio_pte_batch(), which does
+ * not perform boundary checks itself, we must verify that the
+ * address range covered by the folio does not cross a PMD boundary.
+ */
+ end_addr = addr + (max_nr * PAGE_SIZE) - 1;
+
+ /*
+ * A fast way to check for a PMD boundary cross is to align both
+ * the start and end addresses to the PMD boundary and see if they
+ * are different. If they are, the range spans across at least two
+ * different PMD-managed regions.
+ */
+ if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+ return false;

You should not be messing with max_nr = folio_nr_pages(folio) here at
all. folio_pte_batch() takes care of that.

Also, way too many comments ;)

You may only batch within a single VMA and within a single page table.

So simply align the addr up to the next PMD, and make sure it does not
exceed the vma end.

ALIGN and friends can help avoiding excessive comments.

Thanks! How about this updated version based on your suggestion:

diff --git a/mm/rmap.c b/mm/rmap.c
index fb63d9256f09..241d55a92a47 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1847,12 +1847,25 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
/* We support batch unmapping of PTEs for lazyfree large folios */
static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
- struct folio *folio, pte_t *ptep)
+ struct folio *folio, pte_t *ptep,
+ struct vm_area_struct *vma)
{
const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+ unsigned long next_pmd, vma_end, end_addr;
int max_nr = folio_nr_pages(folio);
pte_t pte = ptep_get(ptep);
+ /*
+ * Limit the batch scan within a single VMA and within a single
+ * page table.
+ */
+ vma_end = vma->vm_end;
+ next_pmd = ALIGN(addr + 1, PMD_SIZE);
+ end_addr = addr + (unsigned long)max_nr * PAGE_SIZE;
+
+ if (end_addr > min(next_pmd, vma_end))
+ return false;

May I suggest that we clean all that up as we fix it?

Maybe something like this:

diff --git a/mm/rmap.c b/mm/rmap.c
index 3b74bb19c11dd..11fbddc6ad8d6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1845,23 +1845,38 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
#endif
}
-/* We support batch unmapping of PTEs for lazyfree large folios */
-static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
- struct folio *folio, pte_t *ptep)
+static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
+ struct page_vma_mapped_walk *pvmw, enum ttu_flags flags,
+ pte_t pte)
{
const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
- int max_nr = folio_nr_pages(folio);
- pte_t pte = ptep_get(ptep);
+ struct vm_area_struct *vma = pvmw->vma;
+ unsigned long end_addr, addr = pvmw->address;
+ unsigned int max_nr;
+
+ if (flags & TTU_HWPOISON)
+ return 1;
+ if (!folio_test_large(folio))
+ return 1;
+
+ /* We may only batch within a single VMA and a single page table. */
+ end_addr = min_t(unsigned long, ALIGN(addr + 1, PMD_SIZE), vma->vm_end);
+ max_nr = (end_addr - addr) >> PAGE_SHIFT;
+ /* We only support lazyfree batching for now ... */
if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
- return false;
+ return 1;
if (pte_unused(pte))
- return false;
- if (pte_pfn(pte) != folio_pfn(folio))
- return false;
+ return 1;
+ /* ... where we must be able to batch the whole folio. */
+ if (pte_pfn(pte) != folio_pfn(folio) || max_nr != folio_nr_pages(folio))
+ return 1;
+ max_nr = folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, fpb_flags,
+ NULL, NULL, NULL);
- return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
- NULL, NULL) == max_nr;
+ if (max_nr != folio_nr_pages(folio))
+ return 1;
+ return max_nr;
}
/*
@@ -2024,9 +2039,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (pte_dirty(pteval))
folio_mark_dirty(folio);
} else if (likely(pte_present(pteval))) {
- if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
- can_batch_unmap_folio_ptes(address, folio, pvmw.pte))
- nr_pages = folio_nr_pages(folio);
+ nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
end_addr = address + nr_pages * PAGE_SIZE;
flush_cache_range(vma, address, end_addr);

Note that I don't quite understand why we have to batch the whole thing or fallback to
individual pages. Why can't we perform other batches that span only some PTEs? What's special
about 1 PTE vs. 2 PTEs vs. all PTEs?

Can someone enlighten me why that is required?

--
Cheers,

David / dhildenb

Next message: Peter Zijlstra: "Re: [PATCH 1/8] Introduce simple hazard pointers"
Previous message: Baochen Qiang: "Re: [PATCH v2 4/5] wifi: ath11k: fix source ring-buffer corruption"
In reply to: Barry Song: "Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation"
Next in thread: Barry Song: "Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]