Re: [RFC] proc: Add a new isolated /proc/pid/mempolicy type.

From: Abel Wu
Date: Tue Sep 27 2022 - 23:10:05 EST


On 9/27/22 9:58 PM, Michal Hocko wrote:
On Tue 27-09-22 21:07:02, Abel Wu wrote:
On 9/27/22 6:49 PM, Michal Hocko wrote:
On Tue 27-09-22 11:20:54, Abel Wu wrote:
[...]
Btw.in order to add per-thread-group mempolicy, is it possible to add
mempolicy in mm_struct?

I dunno. This would make the mempolicy interface even more confusing.
Per mm behavior makes a lot of sense but we already do have per-thread
semantic so I would stick to it rather than introducing a new semantic.

Why is this really important?

We want soft control on memory footprint of background jobs by applying
NUMA preferences when necessary, so the impact on different NUMA nodes
can be managed to some extent. These NUMA preferences are given by the
control panel, and it might not be suitable to overwrite the tasks with
specific memory policies already (or vice versa).

Maybe the answer is somehow implicit but I do not really see any
argument for the per thread-group semantic here. In other words why a
new interface has to cover more than the local [sg]et_mempolicy?
I can see convenience as one potential argument. Also if there is a
requirement to change the policy in atomic way then this would require a
single syscall.

Convenience is not our major concern. A well-tuned workload can have
specific memory policies for different tasks/vmas in one process, and
this can be achieved by set_mempolicy()/mbind() respectively. While
other workloads are not, they don't care where the memory residents,
so the impact they brought on the co-located workloads might vary in
different NUMA nodes.

The control panel, which has a full knowledge of workload profiling,
may want to interfere the behavior of the non-mempolicied processes
by giving them NUMA preferences, to better serve the co-located jobs.

So in this scenario, a process's memory policy can be assigned by two
objects dynamically:

a) the process itself, through set_mempolicy()/mbind()
b) the control panel, but API is not available right now

Considering the two policies should not fight each other, it sounds
reasonable to introduce a new syscall to assign memory policy to a
process through struct mm_struct.

So you want to allow restoring the original local policy if the external
one is disabled?

Pretty much, but the internal policies are expected to have precedence
over the external ones, since they are set for some reason to meet their
specific requirements. The external ones are used only when there is no
internal policy active.


Anyway, pidfd_$FOO behavior should be semantically very similar to the
original $FOO. Moving from per-task to per-mm is a major shift in the
semantic. I can imagine to have a dedicated flag for the syscall to
enforce the policy to the full thread group. But having a different
semantic is both tricky and also constrained because per-thread binding
is then impossible.

Agreed. What about a syscall only apply to per-mm? There are precedents
like process_madvice(2).