Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
From: Lorenzo Stoakes
Date: Tue Jul 01 2025 - 04:07:06 EST
On Tue, Jul 01, 2025 at 08:33:32AM +0100, Ryan Roberts wrote:
> On 01/07/2025 05:44, Dev Jain wrote:
> >
> > On 30/06/25 6:27 pm, Lorenzo Stoakes wrote:
> >> On Sat, Jun 28, 2025 at 05:04:33PM +0530, Dev Jain wrote:
> >>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >>> Architecture can override these helpers; in case not, they are implemented
> >>> as a simple loop over the corresponding single pte helpers.
> >>>
> >>> Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
> >> Looks generally sensible! Some comments below.
> >>
> >>> ---
> >>> include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
> >>> mm/mprotect.c | 4 +-
> >>> 2 files changed, 84 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index cf1515c163e2..662f39e7475a 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct
> >>> vm_area_struct *vma,
> >>>
> >>> /*
> >>> * Commit an update to a pte, leaving any hardware-controlled bits in
> >>> - * the PTE unmodified.
> >>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
> >>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
> >>> */
> >>> static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
> >>> unsigned long addr,
> >>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct
> >>> vm_area_struct *vma,
> >>> __ptep_modify_prot_commit(vma, addr, ptep, pte);
> >>> }
> >>> #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> >>> +
> >>> +/**
> >>> + * modify_prot_start_ptes - Start a pte protection read-modify-write
> >>> transaction
> >>> + * over a batch of ptes, which protects against asynchronous hardware
> >>> + * modifications to the ptes. The intention is not to prevent the hardware from
> >>> + * making pte updates, but to prevent any updates it may make from being lost.
> >>> + * Please see the comment above ptep_modify_prot_start() for full description.
> >>> + *
> >>> + * @vma: The virtual memory area the pages are mapped into.
> >>> + * @addr: Address the first page is mapped at.
> >>> + * @ptep: Page table pointer for the first entry.
> >>> + * @nr: Number of entries.
> >>> + *
> >>> + * May be overridden by the architecture; otherwise, implemented as a simple
> >>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
> >>> + * in the batch.
> >>> + *
> >>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
> >>> + *
> >>> + * Context: The caller holds the page table lock. The PTEs map consecutive
> >>> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> >>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
> >>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
> >>> + * mprotect_folio_pte_batch()).
> >>> + */
> >>> +#ifndef modify_prot_start_ptes
> >>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >>> + unsigned long addr, pte_t *ptep, unsigned int nr)
> >>> +{
> >>> + pte_t pte, tmp_pte;
> >>> +
> >>> + pte = ptep_modify_prot_start(vma, addr, ptep);
> >>> + while (--nr) {
> >>> + ptep++;
> >>> + addr += PAGE_SIZE;
> >>> + tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >>> + if (pte_dirty(tmp_pte))
> >>> + pte = pte_mkdirty(pte);
> >>> + if (pte_young(tmp_pte))
> >>> + pte = pte_mkyoung(pte);
> >>> + }
> >>> + return pte;
> >>> +}
> >>> +#endif
> >>> +
> >>> +/**
> >>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> >>> + * hardware-controlled bits in the PTE unmodified.
> >>> + *
> >>> + * @vma: The virtual memory area the pages are mapped into.
> >>> + * @addr: Address the first page is mapped at.
> >>> + * @ptep: Page table pointer for the first entry.
> >>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
> >>> + * @pte: New page table entry to be set.
> >>> + * @nr: Number of entries.
> >>> + *
> >>> + * May be overridden by the architecture; otherwise, implemented as a simple
> >>> + * loop over ptep_modify_prot_commit().
> >>> + *
> >>> + * Context: The caller holds the page table lock. The PTEs are all in the same
> >>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
> >>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
> >>> + * a/d bits on which were off in old_pte.
> >>> + */
> >>> +#ifndef modify_prot_commit_ptes
> >>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
> >>> unsigned long addr,
> >>> + pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >>> +{
> >>> + int i;
> >>> +
> >>> + for (i = 0; i < nr; ++i) {
> >>> + ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>> + ptep++;
> >> Weird place to put this increment, maybe just stick it in the for loop.
> >>
> >>> + addr += PAGE_SIZE;
> >> Same comment here.
> >
> > Sure.
> >
> >>
> >>> + old_pte = pte_next_pfn(old_pte);
> >> Could be:
> >>
> >> old_pte = pte;
> >>
> >> No?
> >
> > We will need to update old_pte also since that
> > is used by powerpc in radix__ptep_modify_prot_commit().
>
> I think perhaps Lorenzo has the model in his head where old_pte is the previous
> pte in the batch. That's not the case. old_pte is the value of the pte in the
> current position of the batch before any changes were made. pte is the new value
> for the pte. So we need to expliticly advance the PFN in both old_pte and pte
> each iteration round the loop.
Yeah, you're right, apologies, I'd misinterpreted.
I really, really, really hate how all this is implemented. This is obviously an
mprotect() and legacy thing but it's almost designed for confusion. Not the
fault of this series, and todo++ on improving mprotect as a whole (been on my
list for a while...)
So we're ultimately updating ptep (this thing that we update, of course, is
buried in the middle of the function invocation) in:
ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
We are setting *ptep++ = pte essentially (roughly speaking) right?
And the arch needs to know about any bits that have changed I guess hence
providing old_pte as well right?
OK so yeah, I get it now, we're not actually advancing through ptes here, we're
just advancing the PFN and applying the same 'template'.
How about something like:
static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
{
int i;
for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) {
ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
/* Advance PFN only, set same flags. */
old_pte = pte_next_pfn(old_pte);
pte = pte_next_pfn(pte);
}
}
Neatens it up a bit and makes it clear that we're effectively propagating the
flags here.
>
> >
> >>
> >>> + pte = pte_next_pfn(pte);
> >>> + }
> >>> +}
> >>> +#endif
> >>> +
> >>> #endif /* CONFIG_MMU */
> >>>
> >>> /*
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index af10a7fbe6b8..627b0d67cc4a 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>> continue;
> >>> }
> >>>
> >>> - oldpte = ptep_modify_prot_start(vma, addr, pte);
> >>> + oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> >>> ptent = pte_modify(oldpte, newprot);
> >>>
> >>> if (uffd_wp)
> >>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>> can_change_pte_writable(vma, addr, ptent))
> >>> ptent = pte_mkwrite(ptent, vma);
> >>>
> >>> - ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> >>> + modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> >>> if (pte_needs_flush(oldpte, ptent))
> >>> tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> >>> pages++;
> >>> --
> >>> 2.30.2
> >>>
>