Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task

From: Chen, Yu C
Date: Tue Jun 03 2025 - 10:46:35 EST


Hi Michal,

On 6/3/2025 12:53 AM, Michal Koutný wrote:
On Tue, May 27, 2025 at 11:15:33AM -0700, Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
I am now more inclined to keep these new stats in memory.stat as the
current version is doing because:

1. Relevant stats are exposed through the same interface and we already
have numa balancing stats in memory.stat.

2. There is no single good home for these new stats and exposing them in
cpu.stat would require more code and even if we reuse memcg infra, we
would still need to flush the memcg stats, so why not just expose in
the memory.stat.

3. Though a bit far fetched, I think we may add more stats which sit at
the boundary of sched and mm in future. Numa balancing is one
concrete example of such stats. I am envisioning for reliable memory
reclaim or overcommit, there might be some useful events as well.
Anyways it is still unbaked atm.


Michal, let me know your thought on this.

I reckon users may be little bit more likely to look that info in
memory.stat.

Which would be OK unless threaded subtrees are considered (e.g. cpuset
(NUMA affinity) has thread granularity) and these migration stats are
potentially per-thread relevant.


I was also pondering why cannot be misplaced container found by existing
NUMA stats. Chen has explained task vs page migration in NUMA balancing.
I guess mere page migration number (especially when stagnating) may not
point to the the misplaced container. OK.

Second thing is what is the "misplaced" container. Is it because of
wrong set_mempolicy(2) or cpuset configuration?
If it's the former (i.e.
it requires enabled cpuset controller), it'd justify exposing this info
in cpuset.stat, if it's the latter, the cgroup aggregation is not that
relevant (hence /proc/<PID>/sched) is sufficient. Or is there another
meaning of a misplaced container? Chen, could you please clarify?

My understanding is that the "misplaced" container is not strictly tied
to set_mempolicy or cpuset configuration, but is mainly caused by the
scheduler's generic load balancer. The generic load balancer spreads
tasks across different nodes to fully utilize idle CPUs, while NUMA
balancing tries to pull misplaced tasks/pages back to honor NUMA locality.

Regarding the threaded subtrees mode, I was previously unfamiliar with
it and have been trying to understand it better. If I understand correctly,
if threads within a single process are placed in different cgroups via cpuset,
we might need to scan /proc/<PID>/sched to collect NUMA task migration/swap
statistics. If threaded subtrees are disabled for that process, we can query
memory.stat.

I agree with your prior point that NUMA balancing task activity is not directly
associated with either the Memory controller or the CPU controller. Although
showing this data in cpu.stat might seem more appropriate, we expose it in
memory.stat due to the following trade-offs(or as an exception for
NUMA balancing):

1.It aligns with existing NUMA-related metrics already present in memory.stat.
2.It simplifies code implementation.

thanks,
Chenyu


Because memory controller doesn't control NUMA, it needn't be enabled
to have this statistics and it cannot be enabled in threaded groups, I'm
having some doubts whether memory.stat is a good home for this field.

Regards,
Michal