Re: [PATCH 0/1] prctl: allow overriding system THP policy to always

From: David Hildenbrand
Date: Fri May 09 2025 - 05:31:04 EST


On 09.05.25 11:24, Yafang Shao wrote:
On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:



On 08/05/2025 06:41, Yafang Shao wrote:
On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:



On 07/05/2025 16:57, Zi Yan wrote:
On 7 May 2025, at 11:12, Usama Arif wrote:

On 07/05/2025 15:57, Zi Yan wrote:
+Yafang, who is also looking at changing THP config at cgroup/container level.

Thanks


On 7 May 2025, at 10:00, Usama Arif wrote:

Allowing override of global THP policy per process allows workloads
that have shown to benefit from hugepages to do so, without regressing
workloads that wouldn't benefit. This will allow such types of
workloads to be run/stacked on the same machine.

It also helps in rolling out hugepages in hyperscaler configurations
for workloads that benefit from them, where a single THP policy is
likely to be used across the entire fleet, and prctl will help override it.

An advantage of doing it via prctl vs creating a cgroup specific
option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
that this will work even when there are no cgroups present, and my
understanding is there is a strong preference of cgroups controls being
hierarchical which usually means them having a numerical value.

Hi Usama,

Do you mind giving an example on how to change THP policy for a set of
processes running in a container (under a cgroup)?

Hi Zi,

In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
for processes in a cgroup is in the same way we enable KSM for the cgroup.
The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
in exec-invoke.
This is at the start of the process, but you would already know at the start of the process
whether you want THP=always for it or not.

[1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045

You also need to add a new systemd.directives, e.g., MemoryTHP, to
pass the THP enablement or disablement info from a systemd config file.
And if you find those processes do not benefit from using THPs,
you can just change the new "MemoryTHP" config and restart the processes.

Am I getting it? Thanks.


Yes, thats right. They would exactly the same as what we (Meta) do
for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
that is set, the prctl will be called at exec_invoke of the process [3].

The systemd changes should be quite simple to do.

[1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
[2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
[3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045

This solution carries a risk: since prctl() does not require any
capabilities, the task itself could call it and override your memory
policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
capability is typically enabled by default in containers, leaving them
still vulnerable.

This approach might work for Kubernetes/container environments, but it
would require substantial code changes to implement securely.


You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
already exists and the someone could use this to slow the process down. So the
approach this patch takes shouldn't be anymore of a security fix then what is already
exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
should be used to restrict this.

I believe we should at least require CAP_SYS_RESOURCE to enable THP,
since it overrides global system settings. Alternatively,
CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
certain.

Hm, could you verbalize a concrete security concern?

I've never really looked at the global settings as a hard policy, more
as picking a default for the workloads in the system. It's usually
`madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
existed to give applications the ability to refine the global choice.

The prctl should probably respect `never' for consistency, but beyond
that I don't really see the concern, or how this would allow something
that isn't already possible.

I would interpret the always, madvise, and never options as follows:
- always
The sysadmin strongly recommends using THP. If a user does not
want to use it, they must explicitly disable it.
- madvise
The sysadmin gently encourages the use of THP, but it is only
enabled when explicitly requested by the application.
- never
The sysadmin discourages the use of THP, and "its use is only permitted
with explicit approval" .

"never" so far means "no thps, no exceptions". We've had serious THP issues in the past, where our workaround until we sorted out the issue for affected customers was to force-disable THPs on that system during boot.

--
Cheers,

David / dhildenb