Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process

From: David Hildenbrand
Date: Fri May 16 2025 - 03:45:44 EST


On 15.05.25 22:35, Lorenzo Stoakes wrote:
On Thu, May 15, 2025 at 09:12:13PM +0200, David Hildenbrand wrote:
On 15.05.25 20:08, Lorenzo Stoakes wrote:
On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
is to override global 'never'?


Again, I am not overriding never.

hugepage_global_always and hugepage_global_enabled will evaluate to false
and you will not get a hugepage.

Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
if the policy is never.

I think it should behave just as if someone does manually an madvise(). So
whatever we do here during an madvise, we should try to do the same thing
here.

Ack I agree with this.

It actually simplifies things a LOT to view it this way - we're saying 'by
default apply madvise(...) to new VMAs'.

Hm I wonder if we could have a more generic version of this...

Note though that we're not _quite_ doing this.

So in hugepage_madvise():

int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
...

switch (advice) {
case MADV_HUGEPAGE:
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;

...

break;

...
}

...
}

So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
proposed code we're not.

Yeah, I think I suggested that, but probably we should just do exactly what
madvise() does.

Yes, agreed.

Usama - do you have any issue with us switching to how madvise() does it?



So we're back into confusing territory again :)

I wonder if we could...

1. Add an MADV_xxx that mimics the desired behaviour here.

2. Add a generic 'madvise() by default' thing at a process level?

Is this crazy?

I think that's what I had in mind, just a bit twisted.

What could work is

1) prctl to set the default

2) madvise() to adjust all existing VMAs


We might have to teach 2) to ignore non-compatible VMAs / holes. Maybe not,
worth an investigation.

Yeah, I think it'd _probably_ be ok except on s390 (which can fail, and so
we'd have to be able to say - skip on error, carry on).

We'll just get an -ENOMEM at the end for the gaps (god how I hate
that). Otherwise I don't think MADV_HUGEPAGE actually is really that
restrictive.

That would simplify :)

But I still so hate using prctl()... this might be one of those cases where
we simply figure out we have no other choice.
> > But when you put it as simply as this maybe it's not so bad. With the
flags2 gone by fixing this stupid 32-bit limit it's less awful.

Perhaps worth seeing what an improved RFC of this series looks like with
all the various bits fixed to give an idea.

Yes.


But you do then wonder if we could make this _generic_ for _any_ madvise(),
and how _that_ would look.

But perhaps that's insane because many VMAs would simply not be suited to
having certain madvise flags set hmm.

Same thinking. I think this is rather special.

In a perfect world not even the madvise(*HUGEPAGE) would exist.

But here we are ... 14 years (wow!) after

commit 0af4e98b6b095c74588af04872f83d333c958c32
Author: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Date: Thu Jan 13 15:46:55 2011 -0800

thp: madvise(MADV_HUGEPAGE)



(I'm surprised you don't complain about madvise(). IMHO, prctl() is even a better interface than catch-all madvise(); a syscall where an advise might not be an advise. I saw some funny rants about MADV_DONTNEED on reddit at some point ... :) mctrl() would have been clearer, at least for me :D )

--
Cheers,

David / dhildenb