Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE

From: David Hildenbrand
Date: Mon Jul 21 2025 - 07:45:33 EST


On 21.07.25 13:28, Lorenzo Stoakes wrote:
Overall, while I HATE this interface (as y'know, everyone knows :P), since
it _already_ exists and fulfils a real need (and we _have_ to keep
supporting that need) I'm open to us solving the issue this way.

So this might be a way for us to achieve what Usama + others need without
having to splice in horridness.

This as a proof-of-concept is obviously not for 6.17 (and late in the day
anyway :P), so we have at least 6.18 cycle to discuss.

On Mon, Jul 21, 2025 at 11:09:42AM +0200, David Hildenbrand wrote:
People want to make use of more THPs, for example, moving from
THP=never to THP=madvise, or from THP=madvise to THP=never.

Nitty, but sort of vague as to what THP= means here, I'd just say 'from
never to madvise, or from madvise to never' - it's pretty clear what you
mean and keeps enough 'flexibility' of interpretation to cover off the
various ways you can do this in the sysfs interfaces.
> > Same comment for simlar below.


While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: once problems are detected, these
workloads should be started with the old behavior, without making all
other workloads on the system go back to the old behavior as well.

I'm confused about what 'old behavior' is here. Also it's not always
necessarily due to problems, there can be a desire to treat THPs as a
resource to be distributed as desired.

So I'd say something like '... transition: individual processes may need to
opt-out from this behaviour for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out'.

No strong opinion.



In essence, the following scenarios are imaginable:

(1) Switch from THP=none to THP=madvise or THP=always, but keep the old
behavior (no THP) for selected workloads.

I'd remove 'old behavior' here as it's confusing, and simply refer to THP
being disabled for selected workloads.

Yes.



(2) Stay at THP=none, but have "madvise" or "always" behavior for
selected workloads.

(3) Switch from THP=madvise to THP=always, but keep the old behavior
(THP only when advised) for selected workloads.

(4) Stay at THP=madvise, but have "always" behavior for selected
workloads.

In essence, (2) can be emulated through (1), by setting THP!=none while
disabling THPs for all processes that don't want THPs. It requires
configuring all workloads, but that is a user-space problem to sort out.

NIT: Delete 'In essence' here.

I wanted "something" there to not make it look like the list keeps going on in a weird way ;)



(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
yet (i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.
That essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per processm,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Ack.


Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

Yes this is the crux of the problem.


While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this scares many THPs in our system
because they could no longer get allocated where they used to be allocated
for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
problem for relevant workloads, because "not THPs" is certainly worse
than "THPs only when advised".

I don't know what you mean by 'scares' many THPs? :P

They are very afraid of not getting allocated :)



Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this

MAD_HUGEPAGE -> MADV_HUGEPAGE

I'm confused by 'unless not explicitly advised' do you mean 'disable THPs
unless explicitly advised by the app through MADV_HUGEPAGE'?

Yes.


would change the documented semantics quite a bit, and the versatility
to use it for debugging purposes, so I am not 100% sure that is what we
want -- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED)?


For now, arg3 was not allowed to be set (-EINVAL). Now it holds
flags.

This sentence is redundant.


(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

For now, it would return 1 if THPs were disabled completely. Now
it essentially returns the set flags as well.

For now as in 'previously'. I guess right now it's just used as a boolean,
so this seems fine.


(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
the semantics clearly.

Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
disabled completely

You mean 'are disabled completely' but this has been covered already :P

Yeah, see my self-reply.



Only indicating that THPs are disabled when they are really disabled
completely, not only partially.


So the really interesting part in the above is the small delta this change
represents... which makes it a lot more compelling as a solution.


The documented semantics in the man page for PR_SET_THP_DISABLE
"is inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g.,
systemd where we fork() a helper process to then exec()).

Yeah, this is something I REALLY don't want us to perpetuate, as it's
adding a now policy method by the 'back door'.

I had actually come to the conclusion that we absolutely should NOT
implement anything like this, as discussed in the THP cabal meeting.

HOWEVER, since this mechanism ALREADY EXISTS for this case, I am much more
ok with this.

We already perpetuate state for this across fork/exec.


There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
to PR_SET_THP_DISABLE through another flag if ever required. The known
users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
that is not added for now.

Yeah not a fan of the seal idea, that will add a bunch of complexity and
state that I would rather not have.

I'd far prefer just disallowing re-enabling THP. We could allow
re-disabling with different flags though.

Be good to get some examples of the old + new prctl() invocations in the
commit message too.


Cc: Jonathan Corbet <corbet@xxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
Cc: Zi Yan <ziy@xxxxxxxxxx>
Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>
Cc: Nico Pache <npache@xxxxxxxxxx>
Cc: Ryan Roberts <ryan.roberts@xxxxxxx>
Cc: Dev Jain <dev.jain@xxxxxxx>
Cc: Barry Song <baohua@xxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Mike Rapoport <rppt@xxxxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Usama Arif <usamaarif642@xxxxxxxxx>
Cc: SeongJae Park <sj@xxxxxxxxxx>
Cc: Jann Horn <jannh@xxxxxxxxxx>
Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
Cc: Yafang Shao <laoar.shao@xxxxxxxxx>
Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>

---

At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
think there might be real use cases where we want to disable any THPs --
in particular also around debugging THP-related problems, and
"THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will

Well, not quite anymore :) it's been this way for a while right? Even since
MADV_COLLAPSE was introduced.

It goes back to 3.15, yes ...


also block MADV_COLLAPSE, which can be very helpful. Of course, I thought

Yes.

I mean I hate, hate, HATE this interface. But it exists and we have to
support it anyway.

of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
I just don't like the semantics.

What do you mean?

Kconfig option to change the behavior etc. In summary, I don't want to go down that path, it all gets weird.



"prctl: allow overriding system THP policy to always"[1] proposed
"overriding policies to always", which is just the wrong way around: we
should not add mechanisms to "enable more" when we already have an
interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
weird otherwise.

Yes. A 'disable but' is more logical.


"[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
setting the default of the VM_HUGEPAGE, which is similarly the wrong way
around I think now.

Yes.


The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
similarly were around the "default for VM_HUGEPAGE" idea, but after the
discussion, I think we should better leave VM_HUGEPAGE untouched.

Well, to be clear, these were more 'if we HAVE to do this, what is the
least awful way of doing this?' rather than proposals per se :P and mctrl()
was really taking existing discussed ideas and -simply seeing how it looked
in code- though in the end we decided better to spell out in words how it
would look.

At least now I'm not in favour of us implementing policy this way (but
again, am open to us extending an _existing_ abomination :)


Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
we essentially want to say "leave advised regions alone" -- "keep THP
enabled for advised regions",

Seems OK to me. I guess the one point of confusion could be people being
confused between the THP toggle 'madvise' vs. actually having MADV_HUGEPAGE
set, but that's moot, because 'madvise' mode only enables THP if the region
has had MADV_HUGEPAGE set.

Right, whatever ends up setting VM_HUGEPAGE.



The only thing I really dislike about this is using another MMF_* flag,
but well, no way around it -- and seems like we could easily support
more than 32 if we want to, or storing this thp information elsewhere.

Yes my exact thoughts. But I will be adding a series to change this for VMA
flags soon enough, and can potentially do mm flags at the same time...

So this shouldn't in the end be as much of a problem.

Maybe it's worth holding off on this until I've done that? But at any rate
I intend to do those changes next cycle, and this will be a next cycle
thing at the earliest anyway.

I don't think this series must be blocked by that. Using a bitmap instead of a single "unsigned long" should be fairly easy later -- I did not identify any big blockers.



I think this here (modifying an existing toggle) is the only prctl()
extension that we might be willing to accept. In general, I agree like
most others, that prctl() is a very bad interface for that -- but
PR_SET_THP_DISABLE is already there and is getting used.

Yes.


Long-term, I think the answer will be something based on bpf[5]. Maybe
in that context, I there could still be value in easily disabling THPs for
selected workloads (esp. debugging purposes).

Jann raised valid concerns[6] about new flags that are persistent across
exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
consider it having a similar security risk as our existing
PR_SET_THP_DISABLE, but devil is in the detail.

Yes...


This is *completely* untested and might be utterly broken. It merely
serves as a PoC of what I think could be done. If this ever goes upstream,
we need some kselftests for it, and extensive tests.

Well :) I mean we should definitely try this out in anger and it _MUST_
have self tests and put under some pressure.

Usama, can you attack this and see?

Yes, that's what I am hoping for.



[1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@xxxxxxxxx
[2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@xxxxxxxxx
[3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@xxxxxxxxxx
[4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
[5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@xxxxxxxxx
[6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@xxxxxxxxxxxxxx

---
Documentation/filesystems/proc.rst | 5 +--
fs/proc/array.c | 2 +-
include/linux/huge_mm.h | 20 ++++++++---
include/linux/mm_types.h | 13 +++----
include/uapi/linux/prctl.h | 7 ++++
kernel/sys.c | 58 +++++++++++++++++++++++-------
mm/khugepaged.c | 2 +-
7 files changed, 78 insertions(+), 29 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b72353..915a3e44bc120 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,8 +291,9 @@ It's slow but very precise.
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
- THP_enabled process is allowed to use THP (returns 0 when
- PR_SET_THP_DISABLE is set on the process
+ THP_enabled process is allowed to use THP (returns 0 when
+ PR_SET_THP_DISABLE is set on the process to disable
+ THP completely, not just partially)

Hmm but this means we have no way of knowing if it's set for partial

Yes. I briefly thought about indicating another member, but then I thought (a) it's ugly and (b) "who cares".

I also thought about just printing "partial" instead of "1", but not sure if that would break any parser.


Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa931..c4f91a784104f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);

if (thp_enabled)
- thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
+ thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
}

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e0a27f80f390d..c4127104d9bc3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -323,16 +323,26 @@ struct thpsize {
(transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))

+/*
+ * Check whether THPs are explicitly disabled through madvise or prctl, or some
+ * architectures may disable THP for some mappings, for example, s390 kvm.
+ */
static inline bool vma_thp_disabled(struct vm_area_struct *vma,
vm_flags_t vm_flags)

This _should_ work as we set/clear VM_HUGEPAGE, VM_NOHUGEPAGE in madvise()
unconditionally without bothering to check available THP orders first so no
chicken-and-egg here.

{
+ /* Are THPs disabled for this VMA? */
+ if (vm_flags & VM_NOHUGEPAGE)
+ return true;
+ /* Are THPs disabled for all VMAs in the whole process? */
+ if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
+ return true;
/*
- * Explicitly disabled through madvise or prctl, or some
- * architectures may disable THP for some mappings, for
- * example, s390 kvm.
+ * Are THPs disabled only for VMAs where we didn't get an explicit
+ * advise to use them?

Probably fine to drop the rather pernickety reference to s390 kvm here, I
mean we don't need to spell out massively specific details in a general
handler.

No strong opinion.


*/
- return (vm_flags & VM_NOHUGEPAGE) ||
- test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
+ if (vm_flags & VM_HUGEPAGE)
+ return false;
+ return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
}

static inline bool thp_disabled_by_hw(void)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1ec273b066915..a999f2d352648 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1743,19 +1743,16 @@ enum {
#define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
#define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */

-/*
- * This one-shot flag is dropped due to necessity of changing exe once again
- * on NFS restore
- */
-//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */
+#define MMF_HUGE_ZERO_PAGE 18 /* mm has ever used the global huge zero page */

#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
#define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */
#define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */
-#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */
-#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */
-#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
+#define MMF_DISABLE_THP_EXCEPT_ADVISED 23 /* no THP except for VMAs with VM_HUGEPAGE */
+#define MMF_DISABLE_THP_COMPLETELY 24 /* no THP for all VMAs */
+#define MMF_DISABLE_THP_MASK ((1 << MMF_DISABLE_THP_COMPLETELY) |\
+ (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))

It feels a bit sigh to have to use up low-supply mm flags for this. But
again, I should be attacking this shortage soon enough.

Are we not going ahead with Barry's series that was going to use one of
these in the end?

Whoever gets acked first ;)


#define MMF_OOM_REAP_QUEUED 25 /* mm was queued for oom_reaper */
#define MMF_MULTIPROCESS 26 /* mm is shared between processes */
/*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43dec6eed559a..1949bb9270d48 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -177,7 +177,14 @@ struct prctl_mm_map {

#define PR_GET_TID_ADDRESS 40

+/*
+ * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
+ * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
+ * specified for PR_SET_THP_DISABLE.
+ */

Probably worth specifying that you're just returning the flags here.

Yes.

Thanks!

--
Cheers,

David / dhildenb