Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap

From: Michal Hocko
Date: Wed Jul 26 2017 - 01:45:42 EST


On Tue 25-07-17 20:26:19, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> > That problem is real though as reported by David.
>
> I'm not against fixing it, I just think it's not a major concern, and
> the solution doesn't seem optimal as measured by Kirill.
>
> I'm just skeptical it's the best to solve that tiny race, 99.9% of the
> time such down_write is unnecessary.
>
> > it is not only about exit_mmap. __mmput calls into exit_aio and that can
> > wait for completion and there is no way to guarantee this will finish in
> > finite time.
>
> exit_aio blocking is actually the only good point for wanting this
> concurrency where exit_mmap->unmap_vmas and
> oom_reap_task->unmap_page_range have to run concurrently on the same
> mm.

Yes, exit_aio is the only blocking call I know of currently. But I would
like this to be as robust as possible and so I do not want to rely on
the current implementation. This can change in future and I can
guarantee that nobody will think about the oom path when adding
something to the final __mmput path.

> exit_mmap would have no issue, if there was enough time in the
> lifetime CPU to allocate the memory, sure the memory will also be
> freed in finite amount of time by exit_mmap.

I am not sure I understand. Say that any call prior to unmap_vmas blocks
on a lock which is held by another call path which cannot proceed with
the allocation...

> In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
> solve that, so depending on the runtime it may have been better not to
> wait all memory of the process to be freed before moving to the next
> task, but only a couple of seconds before the OOM reaper moves to a
> new candidate. Again this is only a tradeoff between solving the OOM
> faster vs risk of false positives OOM.

I really do not want to rely on any timing. This just too fragile. Once
we have killed a task then we shouldn't pick another victim until it
passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
false positives while we have already disrupted the workload.

> If it wasn't because of exit_aio (which may have to wait I/O
> completion), changing the OOM reaper to return "false" if
> mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
> have been enough (and depending on the runtime it may have solved OOM
> faster in NUMA) and there would be absolutely no need to run OOM
> reaper and exit_mmap concurrently on the same mm. However there's such
> exit_aio..
>
> Raw I/O mempools never require memory allocations, although aio if it
> involves a filesystem to complete may run into filesystem or buffering
> locks which are known to loop forever or depend on other tasks stuck
> in kernel allocations, so I didn't go down that chain too long.

Exactly. We simply cannot assume anything here because veryfying this
basically impossible.

[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..615133762b99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
> /* Use -1 here to ensure all VMAs in the mm are unmapped */
> unmap_vmas(&tlb, vma, 0, -1);
>
> + if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> + /* wait the OOM reaper to stop working on this mm */
> + down_write(&mm->mmap_sem);
> + up_write(&mm->mmap_sem);
> + }
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb, 0, -1);
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..2a7000995784 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> struct mmu_gather tlb;
> struct vm_area_struct *vma;
> bool ret = true;
> + bool mmgot = true;
>
> /*
> * We have to make sure to not race with the victim exit path
> @@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> * and delayed __mmput doesn't matter that much
> */
> if (!mmget_not_zero(mm)) {
> - up_read(&mm->mmap_sem);
> trace_skip_task_reaping(tsk->pid);
> - goto unlock_oom;
> + /*
> + * MMF_OOM_SKIP is set by exit_mmap when the OOM
> + * reaper can't work on the mm anymore.
> + */
> + if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> + up_read(&mm->mmap_sem);
> + goto unlock_oom;
> + }
> + mmgot = false;
> }

This will work more or less the same to what we have currently.

[victim] [oom reaper] [oom killer]
do_exit __oom_reap_task_mm
mmput
__mmput
mmget_not_zero
test_and_set_bit(MMF_OOM_SKIP)
oom_evaluate_task
# select next victim
# reap the mm
unmap_vmas

so we can select a next victim while the current one is still not
completely torn down.

> > > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > > mmap_sem and then the arch code will release the mmap_sem despite
> > > it was already released by handle_mm_fault? Anonymous memory faults
> > > aren't common to return VM_FAULT_RETRY but an userfault
> > > can. Shouldn't there be a block that prevents overwriting if
> > > VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > >
> > > if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > ret = VM_FAULT_SIGBUS;
> >
> > I am not sure I understand what you mean and how this is related to the
> > patch?
>
> It's not related to the patch but it involves the OOM reaper as it
> only happens when MMF_UNSTABLE is set which is set only by the OOM
> reaper. I was simply reading the OOM reaper code and following up what
> MMF_UNSTABLE does and it ringed a bell.

I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
reaped memory") will clarify this code. If not please start a new thread
so that we do not conflate different things together.

--
Michal Hocko
SUSE Labs