Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap

From: Andrea Arcangeli
Date: Wed Jul 26 2017 - 12:29:21 EST


On Wed, Jul 26, 2017 at 07:45:57AM +0200, Michal Hocko wrote:
> On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote:
> > On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> > > - down_write(&mm->mmap_sem);
> > > + if (tsk_is_oom_victim(current))
> > > + down_write(&mm->mmap_sem);
> > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > tlb_finish_mmu(&tlb, 0, -1);
> > >
> > > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
> > > }
> > > mm->mmap = NULL;
> > > vm_unacct_memory(nr_accounted);
> > > - up_write(&mm->mmap_sem);
> > > + if (tsk_is_oom_victim(current))
> > > + up_write(&mm->mmap_sem);
> >
> > How is this possibly safe? mark_oom_victim can run while exit_mmap is
> > running.
>
> I believe it cannot. We always call mark_oom_victim (on !current) with
> task_lock held and check task->mm != NULL and we call do_exit->mmput after
> mm is set to NULL under the same lock.

Holding the mmap_sem for writing and setting mm->mmap to NULL to
filter which tasks already released the mmap_sem for writing post
free_pgtables still look unnecessary to solve this.

Using MMF_OOM_SKIP as flag had side effects of oom_badness() skipping
it, but we can use the same tsk_is_oom_victim instead and relay on the
locking in mark_oom_victim you pointed out above instead of the
test_and_set_bit of my patch, because current->mm is already NULL at
that point.

A race at the light of the above now is, because current->mm is NULL by the
time mmput is called, how can you start the oom_reap_task on a process
with current->mm NULL that called the last mmput and is blocked
in exit_aio? It looks like no false positive can get fixed until this
is solved first because

Isn't this enough? If this is enough it avoids other modification to
the exit_mmap runtime that looks unnecessary: mm->mmap = NULL replaced
by MMF_OOM_SKIP that has to be set anyway by __mmput later and one
unnecessary branch to call the up_write.