Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng task swap

From: Michal Hocko
Date: Thu Jul 03 2025 - 03:18:28 EST


On Thu 03-07-25 00:32:47, Chen Yu wrote:
> It was reported that after Commit ad6b26b6a0a7
> ("sched/numa: add statistics of numa balance task"),
> a NULL pointer exception[1] occurs when accessing
> p->mm. The following race condition was found to
> trigger this bug: After a swap task candidate is
> chosen during NUMA balancing, its mm_struct is
> released due to task exit. Later, when the task
> swapping is performed, p->mm is NULL, which causes
> the problem:
>
> CPU0 CPU1
> :
> ...
> task_numa_migrate
> task_numa_find_cpu
> task_numa_compare
> # a normal task p is chosen
> env->best_task = p
>
> # p exit:
> exit_signals(p);
> p->flags |= PF_EXITING
> exit_mm
> p->mm = NULL;
>
> migrate_swap_stop
> __migrate_swap_task((arg->src_task, arg->dst_cpu)
> count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
>
> Fix this issue by checking if the task has the PF_EXITING
> flag set in migrate_swap_stop(). If it does, skip updating
> the memcg events. Additionally, log a warning if p->mm is
> NULL to facilitate future debugging.
>
> Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
> Reported-by: Jirka Hladky <jhladky@xxxxxxxxxx>
> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@xxxxxxxxxxxxxx/
> Reported-by: Srikanth Aithal <Srikanth.Aithal@xxxxxxx>
> Reported-by: Suneeth D <Suneeth.D@xxxxxxx>
> Suggested-by: Libo Chen <libo.chen@xxxxxxxxxx>
> Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
> ---
> kernel/sched/core.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8988d38d46a3..4e06bb955dad 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3364,7 +3364,14 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
> {
> __schedstat_inc(p->stats.numa_task_swapped);
> count_vm_numa_event(NUMA_TASK_SWAP);
> - count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> + /* exiting task has NULL mm */
> + if (!(p->flags & PF_EXITING)) {
> + WARN_ONCE(!p->mm, "swap task %d %s %x has no mm\n",
> + p->pid, p->comm, p->flags);

As Andrew already said this is not really acceptable because this is
very likely too easy to trigger and a) you do not want logs flooded with
warnings and also there are setups with panic_on_warn configured and for
those this would be a fatal situation without any good reason.

> +
> + if (p->mm)
> + count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> + }

Why are you testing for p->mm here? Isn't PF_EXITING test sufficient?
A robust way to guarantee non-NULL mm against races when a task is
exiting is find_lock_task_mm. Probably too heavy weight for this path.
>
> if (task_on_rq_queued(p)) {
> struct rq *src_rq, *dst_rq;
> --
> 2.25.1
>

--
Michal Hocko
SUSE Labs