Re: [PATCH] mm: swap: check for xa_zero_entry() on vma in swapoff path
From: David Hildenbrand
Date: Fri Aug 08 2025 - 08:01:55 EST
On 08.08.25 11:21, Charan Teja Kalla wrote:
It is possible to hit a zero entry while traversing the vmas in
unuse_mm(), called from the swapoff path. Not checking the zero entry
can result into operating on it as vma which leads into oops.
The issue is manifested from the below race between the fork() on a
process and swapoff:
fork(dup_mmap()) swapoff(unuse_mm)
--------------- -----------------
1) Identical mtree is built using
__mt_dup().
2) copy_pte_range()-->
copy_nonpresent_pte():
The dst mm is added into the
mmlist to be visible to the
swapoff operation.
3) Fatal signal is sent to the parent
process(which is the current during the
fork) thus skip the duplication of the
vmas and mark the vma range with
XA_ZERO_ENTRY as a marker for this process
that helps during exit_mmap().
4) swapoff is tried on the
'mm' added to the 'mmlist' as
part of the 2.
5) unuse_mm(), that iterates
through the vma's of this 'mm'
will hit the non-NULL zero entry
and operating on this zero entry
as a vma is resulting into the
oops.
That looks like something Liam or Lorenzo could help with reviewing.
I suspect a proper fix would be around not exposing this
partially-valid tree to others when droping the mmap lock ...
While we dup the mm, the new process MM is write-locked -- see
dup_mmap() -- and unuse_mm() will read-lock the mmap_lock. So
in that period everything is fine.
I guess the culprit is in dup_mmap() when we do on error:
} else {
/*
* The entire maple tree has already been duplicated. If the
* mmap duplication fails, mark the failure point with
* XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
* stop releasing VMAs that have not been duplicated after this
* point.
*/
if (mpnt) {
mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
mas_store(&vmi.mas, XA_ZERO_ENTRY);
/* Avoid OOM iterating a broken tree */
set_bit(MMF_OOM_SKIP, &mm->flags);
}
/*
* The mm_struct is going to exit, but the locks will be dropped
* first. Set the mm_struct as unstable is advisable as it is
* not fully initialised.
*/
set_bit(MMF_UNSTABLE, &mm->flags);
}
Shouldn't we just remove anything from the tree here that was not copied
immediately?
--
Cheers,
David / dhildenb