Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit

From: Lorenzo Stoakes
Date: Tue Jul 01 2025 - 04:07:06 EST

Next message: Dev Jain: "Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching"
Previous message: Geert Uytterhoeven: "Re: [PATCH v2 1/3] clk: renesas: r9a09g047: Add clock and reset signals for the GBETH IPs"
In reply to: Ryan Roberts: "Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit"
Next in thread: Ryan Roberts: "Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jul 01, 2025 at 08:33:32AM +0100, Ryan Roberts wrote:
> On 01/07/2025 05:44, Dev Jain wrote:
> >
> > On 30/06/25 6:27 pm, Lorenzo Stoakes wrote:
> >> On Sat, Jun 28, 2025 at 05:04:33PM +0530, Dev Jain wrote:
> >>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >>> Architecture can override these helpers; in case not, they are implemented
> >>> as a simple loop over the corresponding single pte helpers.
> >>>
> >>> Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
> >> Looks generally sensible! Some comments below.
> >>
> >>> ---
> >>> include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
> >>> mm/mprotect.c           | 4 +-
> >>> 2 files changed, 84 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index cf1515c163e2..662f39e7475a 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct
> >>> vm_area_struct *vma,
> >>>
> >>> /*
> >>>    * Commit an update to a pte, leaving any hardware-controlled bits in
> >>> - * the PTE unmodified.
> >>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
> >>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
> >>>    */
> >>> static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
> >>>                          unsigned long addr,
> >>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct
> >>> vm_area_struct *vma,
> >>>       __ptep_modify_prot_commit(vma, addr, ptep, pte);
> >>> }
> >>> #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> >>> +
> >>> +/**
> >>> + * modify_prot_start_ptes - Start a pte protection read-modify-write
> >>> transaction
> >>> + * over a batch of ptes, which protects against asynchronous hardware
> >>> + * modifications to the ptes. The intention is not to prevent the hardware from
> >>> + * making pte updates, but to prevent any updates it may make from being lost.
> >>> + * Please see the comment above ptep_modify_prot_start() for full description.
> >>> + *
> >>> + * @vma: The virtual memory area the pages are mapped into.
> >>> + * @addr: Address the first page is mapped at.
> >>> + * @ptep: Page table pointer for the first entry.
> >>> + * @nr: Number of entries.
> >>> + *
> >>> + * May be overridden by the architecture; otherwise, implemented as a simple
> >>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
> >>> + * in the batch.
> >>> + *
> >>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
> >>> + *
> >>> + * Context: The caller holds the page table lock. The PTEs map consecutive
> >>> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> >>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
> >>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
> >>> + * mprotect_folio_pte_batch()).
> >>> + */
> >>> +#ifndef modify_prot_start_ptes
> >>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
> >>> +{
> >>> +    pte_t pte, tmp_pte;
> >>> +
> >>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
> >>> +    while (--nr) {
> >>> +        ptep++;
> >>> +        addr += PAGE_SIZE;
> >>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >>> +        if (pte_dirty(tmp_pte))
> >>> +            pte = pte_mkdirty(pte);
> >>> +        if (pte_young(tmp_pte))
> >>> +            pte = pte_mkyoung(pte);
> >>> +    }
> >>> +    return pte;
> >>> +}
> >>> +#endif
> >>> +
> >>> +/**
> >>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> >>> + * hardware-controlled bits in the PTE unmodified.
> >>> + *
> >>> + * @vma: The virtual memory area the pages are mapped into.
> >>> + * @addr: Address the first page is mapped at.
> >>> + * @ptep: Page table pointer for the first entry.
> >>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
> >>> + * @pte: New page table entry to be set.
> >>> + * @nr: Number of entries.
> >>> + *
> >>> + * May be overridden by the architecture; otherwise, implemented as a simple
> >>> + * loop over ptep_modify_prot_commit().
> >>> + *
> >>> + * Context: The caller holds the page table lock. The PTEs are all in the same
> >>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
> >>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
> >>> + * a/d bits on which were off in old_pte.
> >>> + */
> >>> +#ifndef modify_prot_commit_ptes
> >>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
> >>> unsigned long addr,
> >>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >>> +{
> >>> +    int i;
> >>> +
> >>> +    for (i = 0; i < nr; ++i) {
> >>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>> +        ptep++;
> >> Weird place to put this increment, maybe just stick it in the for loop.
> >>
> >>> +        addr += PAGE_SIZE;
> >> Same comment here.
> >
> > Sure.
> >
> >>
> >>> +        old_pte = pte_next_pfn(old_pte);
> >> Could be:
> >>
> >>         old_pte = pte;
> >>
> >> No?
> >
> > We will need to update old_pte also since that
> > is used by powerpc in radix__ptep_modify_prot_commit().
>
> I think perhaps Lorenzo has the model in his head where old_pte is the previous
> pte in the batch. That's not the case. old_pte is the value of the pte in the
> current position of the batch before any changes were made. pte is the new value
> for the pte. So we need to expliticly advance the PFN in both old_pte and pte
> each iteration round the loop.

Yeah, you're right, apologies, I'd misinterpreted.

I really, really, really hate how all this is implemented. This is obviously an
mprotect() and legacy thing but it's almost designed for confusion. Not the
fault of this series, and todo++ on improving mprotect as a whole (been on my
list for a while...)

So we're ultimately updating ptep (this thing that we update, of course, is
buried in the middle of the function invocation) in:

ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);

We are setting *ptep++ = pte essentially (roughly speaking) right?

And the arch needs to know about any bits that have changed I guess hence
providing old_pte as well right?

OK so yeah, I get it now, we're not actually advancing through ptes here, we're
just advancing the PFN and applying the same 'template'.

How about something like:

static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
{
int i;

for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) {
ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);

/* Advance PFN only, set same flags. */
old_pte = pte_next_pfn(old_pte);
pte = pte_next_pfn(pte);
}
}

Neatens it up a bit and makes it clear that we're effectively propagating the
flags here.

>
> >
> >>
> >>> +        pte = pte_next_pfn(pte);
> >>> +    }
> >>> +}
> >>> +#endif
> >>> +
> >>> #endif /* CONFIG_MMU */
> >>>
> >>> /*
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index af10a7fbe6b8..627b0d67cc4a 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>                       continue;
> >>>               }
> >>>
> >>> -            oldpte = ptep_modify_prot_start(vma, addr, pte);
> >>> +            oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> >>>               ptent = pte_modify(oldpte, newprot);
> >>>
> >>>               if (uffd_wp)
> >>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>                   can_change_pte_writable(vma, addr, ptent))
> >>>                   ptent = pte_mkwrite(ptent, vma);
> >>>
> >>> -            ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> >>> +            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> >>>               if (pte_needs_flush(oldpte, ptent))
> >>>                   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> >>>               pages++;
> >>> --
> >>> 2.30.2
> >>>
>

Next message: Dev Jain: "Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching"
Previous message: Geert Uytterhoeven: "Re: [PATCH v2 1/3] clk: renesas: r9a09g047: Add clock and reset signals for the GBETH IPs"
In reply to: Ryan Roberts: "Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit"
Next in thread: Ryan Roberts: "Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]