Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
From: Lorenzo Stoakes
Date: Fri Aug 08 2025 - 05:41:26 EST
On Fri, Aug 08, 2025 at 08:19:47AM +0100, Ryan Roberts wrote:
> > Yup David explained.
> >
> > I suggest you rename this from 'hint', because that's not what hint means
> > :) unless I'm really misunderstanding what this word means (it's 10pm and I
> > started work at 6am so it's currently rather possible).
>
> That's a lot of hours; I certainly appreciate you for putting the effort in and
> figuring out the root cause so quickly.
Thanks. Of course not all these hours were spent on this, it was really my
very sincere attempt to help out Dev who might not have an x86-64 bare
metal box as readily available (literally, physically next to me) as I did.
I'm glad I could help identify this! Jann, Dev and David gave very
insightful analysis also, and all happened to be entirely correct!
>
> Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
> sure David named this function, but for me the name makes sense. The arch is
> saying "I know that the pte batch is at least N ptes. It's up to you if you use
> that information. I'll still work correctly if you ignore it".
>
> For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
> would be a guess, not a hint.
I feel we need to drop this subject for sanity's sake... :)
In my view the crux of this is that a reasonable definition of hint (and
first that appears on google) is that _partial_ information is provided,
which I interpreted as a bound.
Of course it's 'partial' in that there may be adjacent physically
contiguous PTE entries, and the whole thing is made murkier by the fact
that the cont PTE bit itself is a hint, which is again presumably why we
named it so.
However, if you, David, Dev and possibly _everybody else_ interprets this
differently, then I'm happy to concede perhaps it's just me getting hung up
on semantics here.
So I won't insist on this changing, though personally I'd still prefer it.
At any rate what it does is now abundantly clear :)
The comment is good by the way, I kicked myself for only reading it after
the fact.
>From my perspective, I think a misinterpreted conversation (from whichever
side) was the underlying issue, and I've again had it reinforced the need
to always work from first principles on review.
[snip]
>
> >
> > I see that folio_pte_batch() can get _more_, is this on the basis of there
> > being adjacent, physically contiguous contPTE entries that can also be
> > batched up?
>
> From folio_pte_batch()'s perspective, they just have to be physically contiguous
> and part of the same folio; they are not *required* to be contpte. They could
> even have different read/write permissions in the middle of the batch (if the
> flags permit); that's one reason why such a batch wouldn't be mapped contpte (a
> contpte mapping is logically a single mapping so the permissions must all be the
> same).
Yeah I suspected this might be the case.
Sooo. Now we're rejecting if the first PTE isn't contPTE, is this a problem?
I remember Dev's first attempt at this checked hint, if == 1 I believe it would
then manually look ahead to see if there was a possible batch.
>
> >
> >>
> >> This function is looking to see if ptep is inside a conpte mapping, and if it
> >> is, it's returning the number of ptes to the end of the contpte mapping (which
> >> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
> >> if the physical memory is appropriately aligned/sized and all belongs to a
> >> single folio.
> >>
> >>>
> >>> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
> >>>
> >>> I suppose we could not even bother with checking if same folio and _just_ check
> >>> if PTEs have consecutive PFNs, which is not very likely if different folio
> >>> but... could that break something?
> >>
> >> Yes something could break; the batch must *all* belong to the same folio.
> >> Functions like set_ptes() require that in their documentation, and arm64 depends
> >> upon it in order not to screw up the access/dirty bits.
> >
> > Turning this around - is a cont pte range guaranteed to belong to only one
> > folio?
>
> Yes.
Great
>
> >
> > If so then we can just limit the range to one batched block for the sake of
> > mremap that perhaps doesn't necessarily hugely benefit from further
> > batching anyway?
>
> Yes.
Also great.
>
> >
> > Let's take the time to check performance on arm64 hardware.
> >
> > Are we able to check to see how things behave if we have small folios only
> > in the tested range on arm64
>
> I thought Dev provided numbers for that, but I'll chat with him and ensure we
> re-test (and broaden the testing if needed) with the new patch.
Thanks.
> >>>> Not sure if that was discussed at some point before we went into the
> >>>> direction of using folios. But there really doesn't seem to be anything
> >>>> gained for other architectures here (as raised by Jann).
> >>>
> >>> Yup... I wonder about the other instances of this... ruh roh.
> >>
> >> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
> >> already needed the folio. I haven't actually looked at how mprotect ended up,
> >> but maybe worth checking to see if it should protect with pte_batch_hint() too.
> >
> > mprotect didn't? I mean let's check.
>
> I think for mprotect, the folio was only previously needed for the numa case. I
> have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
> only get the folio and call it if the next PTE had an adjacent PFN (or something
> like that). But it was deemed to complex. I might be misremembering... could
> have been an internal conversation. I'll chat with Dev about it and revisit.
OK thanks.
Cheers, Lorenzo