Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE

From: David Hildenbrand
Date: Tue Jul 22 2025 - 06:23:23 EST


On 21.07.25 19:27, Usama Arif wrote:


On 21/07/2025 10:09, David Hildenbrand wrote:
People want to make use of more THPs, for example, moving from
THP=never to THP=madvise, or from THP=madvise to THP=never.

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: once problems are detected, these
workloads should be started with the old behavior, without making all
other workloads on the system go back to the old behavior as well.

In essence, the following scenarios are imaginable:

(1) Switch from THP=none to THP=madvise or THP=always, but keep the old
behavior (no THP) for selected workloads.

(2) Stay at THP=none, but have "madvise" or "always" behavior for
selected workloads.

(3) Switch from THP=madvise to THP=always, but keep the old behavior
(THP only when advised) for selected workloads.

(4) Stay at THP=madvise, but have "always" behavior for selected
workloads.

In essence, (2) can be emulated through (1), by setting THP!=none while
disabling THPs for all processes that don't want THPs. It requires
configuring all workloads, but that is a user-space problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
yet (i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.
That essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per processm,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this scares many THPs in our system
because they could no longer get allocated where they used to be allocated
for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
problem for relevant workloads, because "not THPs" is certainly worse
than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
would change the documented semantics quite a bit, and the versatility
to use it for debugging purposes, so I am not 100% sure that is what we
want -- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

For now, arg3 was not allowed to be set (-EINVAL). Now it holds
flags.

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

For now, it would return 1 if THPs were disabled completely. Now
it essentially returns the set flags as well.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
the semantics clearly.

Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
disabled completely

Only indicating that THPs are disabled when they are really disabled
completely, not only partially.

The documented semantics in the man page for PR_SET_THP_DISABLE
"is inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g.,
systemd where we fork() a helper process to then exec()).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
to PR_SET_THP_DISABLE through another flag if ever required. The known
users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
that is not added for now.

Cc: Jonathan Corbet <corbet@xxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
Cc: Zi Yan <ziy@xxxxxxxxxx>
Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>
Cc: Nico Pache <npache@xxxxxxxxxx>
Cc: Ryan Roberts <ryan.roberts@xxxxxxx>
Cc: Dev Jain <dev.jain@xxxxxxx>
Cc: Barry Song <baohua@xxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Mike Rapoport <rppt@xxxxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Usama Arif <usamaarif642@xxxxxxxxx>
Cc: SeongJae Park <sj@xxxxxxxxxx>
Cc: Jann Horn <jannh@xxxxxxxxxx>
Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
Cc: Yafang Shao <laoar.shao@xxxxxxxxx>
Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>

---

At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
think there might be real use cases where we want to disable any THPs --
in particular also around debugging THP-related problems, and
"THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
I just don't like the semantics.

"prctl: allow overriding system THP policy to always"[1] proposed
"overriding policies to always", which is just the wrong way around: we
should not add mechanisms to "enable more" when we already have an
interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
weird otherwise.

"[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
setting the default of the VM_HUGEPAGE, which is similarly the wrong way
around I think now.

The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
similarly were around the "default for VM_HUGEPAGE" idea, but after the
discussion, I think we should better leave VM_HUGEPAGE untouched.

Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
we essentially want to say "leave advised regions alone" -- "keep THP
enabled for advised regions",

The only thing I really dislike about this is using another MMF_* flag,
but well, no way around it -- and seems like we could easily support
more than 32 if we want to, or storing this thp information elsewhere.

I think this here (modifying an existing toggle) is the only prctl()
extension that we might be willing to accept. In general, I agree like
most others, that prctl() is a very bad interface for that -- but
PR_SET_THP_DISABLE is already there and is getting used.

Long-term, I think the answer will be something based on bpf[5]. Maybe
in that context, I there could still be value in easily disabling THPs for
selected workloads (esp. debugging purposes).

Jann raised valid concerns[6] about new flags that are persistent across
exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
consider it having a similar security risk as our existing
PR_SET_THP_DISABLE, but devil is in the detail.

This is *completely* untested and might be utterly broken. It merely
serves as a PoC of what I think could be done. If this ever goes upstream,
we need some kselftests for it, and extensive tests.

[1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@xxxxxxxxx
[2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@xxxxxxxxx
[3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@xxxxxxxxxx
[4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
[5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@xxxxxxxxx
[6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@xxxxxxxxxxxxxx

---
Documentation/filesystems/proc.rst | 5 +--
fs/proc/array.c | 2 +-
include/linux/huge_mm.h | 20 ++++++++---
include/linux/mm_types.h | 13 +++----
include/uapi/linux/prctl.h | 7 ++++
kernel/sys.c | 58 +++++++++++++++++++++++-------
mm/khugepaged.c | 2 +-
7 files changed, 78 insertions(+), 29 deletions(-)


Thanks for the patch David!

As discussed in the other thread, with the below diff

diff --git a/kernel/sys.c b/kernel/sys.c
index 2a34b2f70890..3912f5b6a02d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2447,7 +2447,7 @@ static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
return -EINVAL;
/* Flags are only allowed when disabling. */
- if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
+ if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
return -EINVAL;
if (mmap_write_lock_killable(current->mm))
return -EINTR;


I tested with the below selftest, and it works. It hopefully covers
majority of the cases including fork and re-enabling THPs.
Let me know if it looks ok and please feel free to add this in the
next revision you send.


Once the above diff is included, please feel free to add

Acked-by: Usama Arif <usamaarif642@xxxxxxxxx>
Tested-by: Usama Arif <usamaarif642@xxxxxxxxx>

Thanks!

The latest version lives at

https://github.com/davidhildenbrand/linux/tree/PR_SET_THP_DISABLE

With all current review feedback addressed (primarily around description+comments) + that one fix.




Thanks!

From ee9004e7d34511a79726ee1314aec0503e6351d4 Mon Sep 17 00:00:00 2001
From: Usama Arif <usamaarif642@xxxxxxxxx>
Date: Thu, 15 May 2025 14:33:33 +0100
Subject: [PATCH] selftests: prctl: introduce tests for
PR_THP_DISABLE_EXCEPT_ADVISED

The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the system.
It checks if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "always", so that
the change in behaviour due to PR_THP_DISABLE_EXCEPT_ADVISED can
be seen.

This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if no hugepage is gotten when the process doesn't MADV_HUGEPAGE
- if hugepage is gotten when the process does MADV_HUGEPAGE
- the process can successfully reset the policy to PR_THP_POLICY_SYSTEM
- if hugepage is gotten after the policy reset

Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
---
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/thp_disable.c | 207 ++++++++++++++++++++

Like SJ says, this should better live under mm, then we can also make use of check_huge_anon() and vm_utils.c and probably also THP helpers from thp_settings.h. Most of the helpers you use should be available in some form there already.

With THP helpers in thp_settings.h, you can explicitly set the system policy, to then reset to eh previous version IIRC.

Further, can you make sure to use kselftest infrastructure for the test, preferrably kselftest_harness.h? (see pfnmap.c on one of my latest selftests)

I also wonder if we want to test old behavior, without the flag set. We could also test that MADV_COLLAPSE doesn't succeed in either case.

Ideally, you'd send my patch (see above) along with the selftest, as I suspect there will be more review+changes to the selftest (and only smaller changes to my patch).

Thanks!

--
Cheers,

David / dhildenb