Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

From: David Hildenbrand
Date: Mon May 26 2025 - 08:59:48 EST



To summarize my current view:

1) ebpf: most people are are not a fan of that, and I agree, at least
for this purpose. If we were talking about making better *placement*
decisions using epbf, it would be a different story.

From placement decisions, do you mean placement between memory
tiers/nodes or something else?

More like: which size to place, but it could be extended to other policies, maybe.

Assume we have a page fault and have to decide which size to place.

For a process that we really want to use THPs (VM_HUEPAGE?), we could use the largest free folio possible.

For a process that we don't want to spend valuable THPs on (VM_HUEPAGE not set?), we could use the smallest free folio possible.

Such a possibly might be encoded in an ebpf program I assume.

The hints (prioritize regions/processes, deprioritize regions/processes), such as VM_HUGEPAGE, inputs into such a program.


2) prctl(): the unloved child, and I can understand why. Maybe now is
the right time to stop adding new MM things that feel weird in there.
Maybe we should already have done that with the KSM toggle (guess who
was involved in that ;) ).

At the moment systemd is the user I know of and I think it would very
easy to migrate it to whatever new thing we decide here.

Agreed.



3) process_madvise(): I think it's an interesting extension, but
probably we should just have something that applies to the whole
address space naturally. At least my take for now.

4) new syscall: worth exploring how it would look. I'm especially
interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
make them only apply to selected controls.

Were there any previous discussion on SET_DEFAULT_EXEC? First time I am
hearing about it.

I think it evolved in the discussion here from PMADV_SET_FORK_EXEC_DEFAULT.


Overall I agree with your assessment and thus I was requesting to at
least discuss the new syscall option as well.

Yes.

I am still not sure if having a new "process" [1] mode would be a reasonable alternative to setting the VM_HUGEPAGE/VM_NOHUGEPAGE default. Assuming we would have a "process" mode, we could (a) set the policy per-process using the new syscall we discuss here, and options to (B) set the policy to use for the exec child and (c) maybe an option to seal the policy (depending on who is allowed to set the policy in the first place).

On the + side, we don't lose hints/instructions from the app (VM_HUGEPAGE/VM_NOHUGEPAGE) when changing the policy on an already running process.

The problem I see with the "process" policy is that people might want different "default" policies for processes, which means that we will have to add yet another toggle.


How I hate THP toggles. :)

[1] https://lore.kernel.org/all/CALOAHbB-KQ4+z-Lupv7RcxArfjX7qtWcrboMDdT4LdpoTXOMyw@xxxxxxxxxxxxxx/

--
Cheers,

David / dhildenb