Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

From: David Rientjes
Date: Fri May 25 2018 - 15:44:34 EST


On Fri, 25 May 2018, Tetsuo Handa wrote:

> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm. This can happen for a variety of reasons,
> > including:
> >
> > - the inability to grab mm->mmap_sem in a sufficient amount of time,
> >
> > - when the mm has blockable mmu notifiers that could cause the oom reaper
> > to stall indefinitely,
> >
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> >
> > - when the mm's memory is fully mlocked.
>
> - when the mm's memory is fully mlocked (needs privilege) or
> fully shared (does not need privilege)
>

Good point, that is another way that unnecessary oom killing can occur
because the oom reaper sets MMF_OOM_SKIP far too early. I can make the
change to the commit message.

Also, I noticed in my patch that oom_reap_task() should be doing
list_add_tail() rather than list_add() to enqueue the mm for reaping
again.

> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> >
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> >
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s. It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
>
> I already proposed more simpler one at https://patchwork.kernel.org/patch/9877991/ .
>

It's a similar idea, and I'm glad that we agree that some kind of per-mm
delay is required to avoid this problem. I think yours is simpler, but
consider the other two changes in my patch:

- in the normal exit path, absent any timeout for the mm, we only set
MMF_OOM_SKIP after free_pgtables() when it is known we will not free
any additional memory, which can also cause unnecessary oom killing
because the oom killer races with free_pgtables(), and

- the oom reaper now operates over all concurrent victims instead of
repeatedly trying to take mm->mmap_sem of the first victim, sleeping
many times, retrying, giving up, and moving on the next victim.
Allowing the oom reaper to iterate through all victims can allow
memory freeing such that an allocator may be able to drop mm->mmap_sem.

In fact, with my patch, I don't know of any condition where we kill
additional processes unnecessarily *unless* the victim cannot be oom
reaped or complete memory freeing in the exit path within 10 seconds.
Given how rare oom livelock appears in practice, I think the 10 seconds is
justified because right now it is _trivial_ to oom kill many victims
completely unnecessarily.