Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching

From: Dev Jain
Date: Sun Jul 20 2025 - 10:40:18 EST

Next message: syzbot: "Forwarded:"
Previous message: Simon Horman: "Re: [PATCH net-next] ip6_gre: Factor out common ip6gre tunnel match into helper"
In reply to: Lorenzo Stoakes: "Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching"
Next in thread: Dev Jain: "[PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 20/07/25 4:50 pm, Lorenzo Stoakes wrote:

On Sat, Jul 19, 2025 at 07:16:48PM +0530, Dev Jain wrote:

On 19/07/25 12:19 am, Lorenzo Stoakes wrote:

On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>

Thanks!

You're welcome :)

---
mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 113 insertions(+), 12 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index a1c7d8a4648d..2ddd37b2f462 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
}

static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
- pte_t pte, int max_nr_ptes)
+ pte_t pte, int max_nr_ptes, fpb_t flags)
{
/* No underlying folio, so cannot batch */
if (!folio)
@@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
if (!folio_test_large(folio))
return 1;

- return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
+ return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
}

static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
@@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
return ret;
}

+/* Set nr_ptes number of ptes, starting from idx */
+static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
+ int idx, bool set_write, struct mmu_gather *tlb)
+{
+ /*
+ * Advance the position in the batch by idx; note that if idx > 0,
+ * then the nr_ptes passed here is <= batch size - idx.
+ */
+ addr += idx * PAGE_SIZE;
+ ptep += idx;
+ oldpte = pte_advance_pfn(oldpte, idx);
+ ptent = pte_advance_pfn(ptent, idx);
+
+ if (set_write)
+ ptent = pte_mkwrite(ptent, vma);
+
+ modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
+ if (pte_needs_flush(oldpte, ptent))
+ tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
+}
+
+/*
+ * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
+ * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
+ * that the ptes point to consecutive pages of the same anon large folio.
+ */
+static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
+ struct page *first_page, bool expected_anon_exclusive)
+{
+ int idx;

Nit but:

int end = start_idx + max_len;

for (idx = start_idx + 1; idx < end; idx++) {

Would be a little neater here.

I politely disagree :) start_idx + max_len is *obviously* the
end index, no need to add one more line of code asserting that.

Haha, well disagreement is permitted you know ;) as long as it's polite of
course...

That's fine, this isn't a big deal.

+
+ for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {

Nitty again but the below might be a little clearer?

struct page *page = &firstpage[idx];

if (expected_anon_exclusive != PageAnonExclusive(page))

I don't think so. first_page[idx] may confuse us into thinking that
we have an array of pages. Also, the way you define it assigns a
stack address to struct page *page; this is not a problem in theory
and the code will still be correct, but I will prefer struct page *page
containing the actual address of the linear map struct page, which is
vmemmap + PFN. The way I write it is, I initialize first_page from folio_page()
which will derive the address from folio->page, and folio was derived from
vm_normal_folio() (which was derived from the PFN in the PTE), therefore
first_page will contain the actual vmemmap address of corresponding struct page,
hence it is guaranteed that first_page + x will give me the x'th page in
the folio.

OK, I don't think this is an issue, but I"m not going to press this, as it's a
trivial thing, it's fine as-is :)

+ if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
+ break;
+ }
+ return idx - start_idx;
+}
+
+/*
+ * This function is a result of trying our very best to retain the
+ * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
+ * if the vma is a private vma, and we cannot determine whether to change
+ * the pte to writable just from the vma and the pte, we then need to look
+ * at the actual page pointed to by the pte. Unfortunately, if we have a
+ * batch of ptes pointing to consecutive pages of the same anon large folio,
+ * the anon-exclusivity (or the negation) of the first page does not guarantee
+ * the anon-exclusivity (or the negation) of the other pages corresponding to
+ * the pte batch; hence in this case it is incorrect to decide to change or
+ * not change the ptes to writable just by using information from the first
+ * pte of the batch. Therefore, we must individually check all pages and
+ * retrieve sub-batches.
+ */

Nice comment thanks.

+static void commit_anon_folio_batch(struct vm_area_struct *vma,
+ struct folio *folio, unsigned long addr, pte_t *ptep,
+ pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
+{
+ struct page *first_page = folio_page(folio, 0);
+ bool expected_anon_exclusive;
+ int sub_batch_idx = 0;
+ int len;
+
+ while (nr_ptes) {

I'd prefer this to be:

int i;

...

for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) {

+ expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);

We won't be able to do nr_ptes -= len with this. And personally a while loop
is clearer to me here.

Well, you don't need to :) maybe rename i to pte_idx + pass nr_ptes - pte_idx.

Buuuut I'm not going to press this, it's not a big deal, and I see your point!

Overall the R-b tag still stands with the above unchanged.

Thanks for doing this series and being open to feedback, I feel we're iterated
to something nice here!

We definitely iterated to something nice! I definitely found your review very
useful; as contributors trying to solve our own problems, we simply keep
adding new features and the code becomes a mess of if-else statements - this
is what I felt when I began reading kernel code. I am reading code and
suddenly some conditional handling a uffd-wp pte appears (I know you have
some nice words about uffd :))

Thank you for actually forcing me to think into how I can implement
my ideas into code better!

Cheers, Lorenzo

Next message: syzbot: "Forwarded:"
Previous message: Simon Horman: "Re: [PATCH net-next] ip6_gre: Factor out common ip6gre tunnel match into helper"
In reply to: Lorenzo Stoakes: "Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching"
Next in thread: Dev Jain: "[PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]