[PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run concurrently

From: Andrea Arcangeli
Date: Tue Jul 25 2017 - 14:02:27 EST


This is purely required because exit_aio() may block and exit_mmap() may
never start, if the oom_reap_task cannot start running on a mm with
mm_users == 0.

At the same time if the OOM reaper doesn't wait at all for the memory
of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it
would generate a spurious OOM kill.

If it wasn't because of the exit_aio it would be enough to change the
oom_reap_task in the case it finds mm_users == 0, to wait for a
timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's
not just exit_mmap the problem here so the concurrency of exit_mmap
and oom_reap_task is apparently warranted.

It's a non standard runtime, exit_mmap runs without mmap_sem, and
oom_reap_task runs with the mmap_sem for reading as usual (kind of
MADV_DONTNEED).

The race between the two is solved with pair of test_and_set_bit and a
dummy down_write/up_write cycle on the same lines of the ksm_exit
method.

If the OOM reaper sets the bit first, exit_mmap will wait it to finish
in down_write (before taking down mm structures that would make the
oom_reap_task fail with use after free).

If exit_mmap comes first, oom reaper will skip the mm as it's already
all freed.

Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---
kernel/fork.c | 1 -
mm/mmap.c | 5 +++++
mm/oom_kill.c | 15 ++++++++++++---
3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9ec98b0c4675..ed412d85a596 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -910,7 +910,6 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
- set_bit(MMF_OOM_SKIP, &mm->flags);
mmdrop(mm);
}

diff --git a/mm/mmap.c b/mm/mmap.c
index f19efcf75418..615133762b99 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(&tlb, vma, 0, -1);

+ if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
+ /* wait the OOM reaper to stop working on this mm */
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..2a7000995784 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
struct mmu_gather tlb;
struct vm_area_struct *vma;
bool ret = true;
+ bool mmgot = true;

/*
* We have to make sure to not race with the victim exit path
@@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* and delayed __mmput doesn't matter that much
*/
if (!mmget_not_zero(mm)) {
- up_read(&mm->mmap_sem);
trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
+ /*
+ * MMF_OOM_SKIP is set by exit_mmap when the OOM
+ * reaper can't work on the mm anymore.
+ */
+ if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
+ up_read(&mm->mmap_sem);
+ goto unlock_oom;
+ }
+ mmgot = false;
}

trace_start_task_reaping(tsk->pid);
@@ -547,7 +555,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* different context because we shouldn't risk we get stuck there and
* put the oom_reaper out of the way.
*/
- mmput_async(mm);
+ if (mmgot)
+ mmput_async(mm);
trace_finish_task_reaping(tsk->pid);
unlock_oom:
mutex_unlock(&oom_lock);

This will perform the same as then the set_bit in __mmput can be
removed and test_and_set_bit doesn't cost more (at least on x86, on
other archs it requires an smp_mb too, but it's marginal difference),
still an atomic op that is.

> > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > mmap_sem and then the arch code will release the mmap_sem despite
> > it was already released by handle_mm_fault? Anonymous memory faults
> > aren't common to return VM_FAULT_RETRY but an userfault
> > can. Shouldn't there be a block that prevents overwriting if
> > VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> >
> > if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > ret = VM_FAULT_SIGBUS;
>
> I am not sure I understand what you mean and how this is related to the
> patch?

It's not related to the patch but it involves the OOM reaper as it
only happens when MMF_UNSTABLE is set which is set only by the OOM
reaper. I was simply reading the OOM reaper code and following up what
MMF_UNSTABLE does and it ringed a bell.