Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
From: Ryan Roberts
Date: Fri Aug 08 2025 - 03:57:09 EST
On 08/08/2025 08:45, David Hildenbrand wrote:
>>
>> Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
>> sure David named this function, but for me the name makes sense. The arch is
>> saying "I know that the pte batch is at least N ptes. It's up to you if you use
>> that information. I'll still work correctly if you ignore it".
>
> The last one is the important bit I think.
>
>>
>> For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
>> would be a guess, not a hint.
>
> I'm not a native speaker, so I'll let both of you figure that out. To me it
> makes sense as well ... but well, I was involved when creating that function. :)
>
>>
>>>
>>> I understand the con PTE bit is a 'hint' but as I recall you saying at
>>> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
>>> from, but that's kinda deceptive.
>>>
>>> Anyway the reason I was emphatic here is on the basis that I believe I had
>>> this explained to met his way, which obviously I or whoever it was (don't
>>> recall) must have misunderstood. Or perhaps I hallucinated it... :)
>>
>> FWIW, this is the documentation for the function:
>>
>> /**
>> * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> * @ptep: Page table pointer for the entry.
>> * @pte: Page table entry.
>> *
>> * Some architectures know that a set of contiguous ptes all map the same
>> * contiguous memory with the same permissions. In this case, it can provide a
>> * hint to aid pte batching without the core code needing to scan every pte.
>> *
>> * An architecture implementation may ignore the PTE accessed state. Further,
>> * the dirty state must apply atomically to all the PTEs described by the hint.
>> *
>> * May be overridden by the architecture, else pte_batch_hint is always 1.
>> */
>
> It's actually ... surprisingly good after reading it again after at least a year.
>
>>
>>>
>>> I see that folio_pte_batch() can get _more_, is this on the basis of there
>>> being adjacent, physically contiguous contPTE entries that can also be
>>> batched up?
>
> [...]
>
>>>>>
>>>>>>
>>>>>>
>>>>>> Not sure if that was discussed at some point before we went into the
>>>>>> direction of using folios. But there really doesn't seem to be anything
>>>>>> gained for other architectures here (as raised by Jann).
>>>>>
>>>>> Yup... I wonder about the other instances of this... ruh roh.
>>>>
>>>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>>>> already needed the folio. I haven't actually looked at how mprotect ended up,
>>>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
>>>
>>> mprotect didn't? I mean let's check.
>>
>> I think for mprotect, the folio was only previously needed for the numa case. I
>> have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
>> only get the folio and call it if the next PTE had an adjacent PFN (or something
>> like that). But it was deemed to complex. I might be misremembering... could
>> have been an internal conversation. I'll chat with Dev about it and revisit.
>>
>
> I am probably to blame here, because I think I rejected early to have arm64-only
> optimization, assuming other arch could benefit here as well with batching. But
> as it seems, batching in mremap() code really only serves the cont-pte managing
> code, and the folio_pte_batch() is really entirely unnecessary.
>
> In case of mprotect(), I think really only (a) NUMA and (b) anon-folio write-
> upgrade required the folio. So it's a bit more tricky than mremap() here
> where ... the folio is entirely irrelevant.
>
> One could detect the "anon write-upgrade possible" case early as well, and only
> lookup the folio in that case, otherwise use the straight pte hint.
>
> So I think there is some room for further improvement.
ACK; Dev, perhaps you can take another look at this and work up a patch to more
agressively avoid vm_normal_folio() for mprotect?
Thanks,
Ryan