Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap

From: Chen, Yu C
Date: Fri Jul 04 2025 - 01:58:39 EST


On 7/3/2025 10:01 PM, Peter Zijlstra wrote:
On Thu, Jul 03, 2025 at 09:38:08PM +0800, Chen, Yu C wrote:
Hi Peter,

On 7/3/2025 8:36 PM, Peter Zijlstra wrote:
On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:

I agree. The other parts, schedstat and vmstat, are still quite helpful.
Also tracepoints are more expensive than counters once enabled, I think
that's too much for just counting numbers.

I'm not generally a fan of eBPF, but supposedly it is really good for
stuff like this.

Attaching to a tracepoint and distributing into cgroup buckets seems
like it should be a trivial script.

Yes, it is feasible to use eBPF. On the other hand, if some
existing monitoring programs rely on /proc/{pid}/sched to observe
the NUMA balancing metrics of processes, it might be helpful to
include the NUMA migration/swap information in /proc/{pid}/sched.
This approach can minimize the modifications needed for these
monitoring programs, eliminating the need to add a new BPF script
to obtain NUMA balancing statistics from different sources IMHO.

Maybe...

The thing is, most of the time the effort spend on collecting all these
numbers is wasted energy since nobody ever looks at them.


As for per-task NUMA balancing activity itself, we found it useful for
debugging when trying to ensure that cache-aware load balancing coexists
properly with NUMA balancing.
Sometimes we're stuck with ABI, like the proc files you mentioned. We
can't readily remove them, stuff would break. But does that mean we
should endlessly add to them just because convenient?

> Ideally I would strip out all the statistics and accounting crap and> make sure we have tracepoints (not trace-events) covering all the needed
spots, and then maybe just maybe have a few kernel modules that hook
into those tracepoints to provide the legacy interfaces.

That way, only the people that care get to pay the overhead of actually
collecting the numbers.

One can dream I suppose... :-)

I see.

If I understand correctly, it's generally not recommended to add
new items under /proc. Users are recommended to use tracepoints/events
when trying to collect the statistics, and something like schedstat_inc()
should be avoided. Is the per-task data an exception? We recently exposed
task's slice via /proc/pid/sched : D
thanks,
Chenyu