Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
From: Chen, Yu C
Date: Thu Jun 26 2025 - 05:08:48 EST
Hi Michal,
Thanks for taking a look.
On 6/25/2025 8:19 PM, Michal Koutný wrote:
On Wed, Jun 25, 2025 at 06:23:37PM +0800, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
[Problem Statement]
Currently, NUMA balancing is configured system-wide.
However, in some production environments, different
cgroups may have varying requirements for NUMA balancing.
Some cgroups are CPU-intensive, while others are
memory-intensive. Some do not benefit from NUMA balancing
due to the overhead associated with VMA scanning, while
others prefer NUMA balancing as it helps improve memory
locality. In this case, system-wide NUMA balancing is
usually disabled to avoid causing regressions.
[Proposal]
Introduce a per-cgroup interface to enable NUMA balancing
for specific cgroups.
The balancing works with task granularity already and this new attribute
is not much of a resource to control.
Have you considered a per-task attribute? (sched_setattr(), prctl() or
similar) That one could be inherited and respective cgroups would be
seeded with a process with intended values.
OK, the prctl approach should work. However, setting this
attribute via cgroup might be more convenient for the userspace
IMHO. The original requirement stems from cloud environments,
where it's typically unacceptable to require applications to
modify their code to add prctl(). Thus, the orchestration layer
must handle this. For example, the initial process of the container
needs adjustment. After consulting with cloud-native developers,
I learned that containerd-shim-runc-v2 serves as the first
process. Therefore, we may need to modify the
containerd-shim-runc-v2 code to use prctl for the NUMA
balancing attribute, allowing child processes to inherit the
settings. While if it is per cgroup control, the user can just
touch one sysfs item.
And cpuset could be
traditionally used to restrict the scope of balancing of such tasks.
WDYT?
In some scenarios, cgroups serve as micro-service containers.
They are not bound to any CPU sets and instead run freely on all
online CPUs. These cgroups can be sensitive to CPU capacity, as well
as NUMA locality (involving page migration and task migration).
This interface is associated with the CPU subsystem, which
does not support threaded subtrees, and close to CPU bandwidth
control.
(??) does support
Ah yes, it supports threaded cgroup type. In this case, we
might need to disable the per-cgroup NUMA balance for threaded
cgroup type.
The system administrator needs to set the NUMA balancing mode to
NUMA_BALANCING_CGROUP=4 to enable this feature. When the system is in
NUMA_BALANCING_CGROUP mode, NUMA balancing for all cgroups is disabled
by default. After the administrator enables this feature for a
specific cgroup, NUMA balancing for that cgroup is enabled.
How much dynamic do you such changes to be? In relation to given
cgroup's/process's lifecycle.
I think it depends on the design. Starting from Kubernetes v1.33,
there is a feature called "in-place Pod resize," which allows users
to modify CPU and memory requests and limits for containers(via
cgroup interfaces) in a running Pod — often without needing to
restart the container. That said, if an admin wants to adjust
NUMA balancing settings at runtime (after the monitor detects
excessive remote NUMA memory access), using prctl might require
iterating through each process in the cgroup and invoking prctl
on them individually.
thanks,
Chenyu
Thanks,
Michal