I think it may actually be difficult to do on some level or there was some
reason we couldn't, but I may be mistaken.
Down the rabbit hole we go..
The cloning of the tree happens by copying the tree in DFS and
replacing the old nodes with new nodes. The tree leaves end up being
copied, which contains all the vmas (unless DONT_COPY is set, so
basically always all of them..). When the tree is copied, we have a
duplicate of the tree with pointers to all the vmas in the old process.
The way the tree fails is that we've been unable to finish cloning it,
usually for out of memory reasons. So, this means we have a tree with
new and exciting vmas that have never been used and old but still active
vmas in oldmm.
The failure point is then marked with an XA_ZERO_ENTRY, which will
succeed in storing as it's a direct replacement in the tree so no
allocations necessary. Thus this is safe even in -ENOMEM scenarios.
Clearing out the stale data means we may actually need to allocate to
remove vmas from the new tree, because we use allocated memory in the
maple tree - we'll need to rebalance, new parents, etc, etc.
So, to remove the stale data - we may actually have to allocate memory.
But we're most likely out of memory.. and we don't want to get the
shrinker involved in a broken task teardown, especially since it has
already been run and failed to help..
We could replace all the old vmas with XA_ZERO_ENTRY, but that doesn't
really fix this issue either.
I could make a function that frees all new vmas and destroys the tree
specifically for this failure state?
I'm almost certain this will lead to another whack-a-mole situation, but
those _should_ already be checked or written to work when there are zero
vmas in an mm (going by experience of what the scheduler does with an
empty tree). Syzbot finds these scenarios sometimes via signals or
other corner cases that can happen..
Then again, I also thought the unstable mm should be checked where
necessary to avoid assumptions on the mm state..?
This is funny because we already have a (probably) benign race with oom
here. This code may already visit the mm after __oom_reap_task_mm() and
the mm disappearing, but since the anon vmas should be removed,
unuse_mm() will skip them.
Although, I'm not sure what happens when
mmu_notifier_invalidate_range_start_nonblock() fails AND unuse_mm() is
called on the mm after. Maybe checking the unstable mm is necessary
here anyways?