Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch

From: Dev Jain
Date: Fri Aug 08 2025 - 04:45:02 EST



On 08/08/25 1:26 pm, Ryan Roberts wrote:
On 08/08/2025 08:45, David Hildenbrand wrote:
Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
sure David named this function, but for me the name makes sense. The arch is
saying "I know that the pte batch is at least N ptes. It's up to you if you use
that information. I'll still work correctly if you ignore it".
The last one is the important bit I think.

For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
would be a guess, not a hint.
I'm not a native speaker, so I'll let both of you figure that out. To me it
makes sense as well ... but well, I was involved when creating that function. :)

I understand the con PTE bit is a 'hint' but as I recall you saying at
LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
from, but that's kinda deceptive.

Anyway the reason I was emphatic here is on the basis that I believe I had
this explained to met his way, which obviously I or whoever it was (don't
recall) must have misunderstood. Or perhaps I hallucinated it... :)
FWIW, this is the documentation for the function:

/**
  * pte_batch_hint - Number of pages that can be added to batch without scanning.
  * @ptep: Page table pointer for the entry.
  * @pte: Page table entry.
  *
  * Some architectures know that a set of contiguous ptes all map the same
  * contiguous memory with the same permissions. In this case, it can provide a
  * hint to aid pte batching without the core code needing to scan every pte.
  *
  * An architecture implementation may ignore the PTE accessed state. Further,
  * the dirty state must apply atomically to all the PTEs described by the hint.
  *
  * May be overridden by the architecture, else pte_batch_hint is always 1.
  */
It's actually ... surprisingly good after reading it again after at least a year.

I see that folio_pte_batch() can get _more_, is this on the basis of there
being adjacent, physically contiguous contPTE entries that can also be
batched up?
[...]


Not sure if that was discussed at some point before we went into the
direction of using folios. But there really doesn't seem to be anything
gained for other architectures here (as raised by Jann).
Yup... I wonder about the other instances of this... ruh roh.
IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
already needed the folio. I haven't actually looked at how mprotect ended up,
but maybe worth checking to see if it should protect with pte_batch_hint() too.
mprotect didn't? I mean let's check.
I think for mprotect, the folio was only previously needed for the numa case. I
have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
only get the folio and call it if the next PTE had an adjacent PFN (or something
like that). But it was deemed to complex. I might be misremembering... could
have been an internal conversation. I'll chat with Dev about it and revisit.

I am probably to blame here, because I think I rejected early to have arm64-only
optimization, assuming other arch could benefit here as well with batching. But
as it seems, batching in mremap() code really only serves the cont-pte managing
code, and the folio_pte_batch() is really entirely unnecessary.

In case of mprotect(), I think really only (a) NUMA and (b) anon-folio write-
upgrade required the folio. So it's a bit more tricky than mremap() here
where ... the folio is entirely irrelevant.

One could detect the "anon write-upgrade possible" case early as well, and only
lookup the folio in that case, otherwise use the straight pte hint.

So I think there is some room for further improvement.
ACK; Dev, perhaps you can take another look at this and work up a patch to more
agressively avoid vm_normal_folio() for mprotect?

Yup I'll investigate this in a few weeks time perhaps - try to use pte_batch_hint(),
and when we have to unconditionally retrieve the folio, then use that instead.

I'll also look into whether even for arm64, there is any use in retrieving
the folio at all - the only benefit we get is to batch across contig blocks, but I don't
think there are any atomic operations (refcount/mapcount manipulation) which will be saved.
By batching across a single contig block we save on ptep_get calls and TLBIs, which was
our objective.


Thanks,
Ryan