Re: can't oom-kill zap the victim's memory?

From: David Rientjes
Date: Mon Sep 28 2015 - 18:24:14 EST


On Fri, 25 Sep 2015, Michal Hocko wrote:

> > > I am still not sure how you want to implement that kernel thread but I
> > > am quite skeptical it would be very much useful because all the current
> > > allocations which end up in the OOM killer path cannot simply back off
> > > and drop the locks with the current allocator semantic. So they will
> > > be sitting on top of unknown pile of locks whether you do an additional
> > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > only argument that I can see is the stack usage but I haven't seen stack
> > > overflows in the OOM path AFAIR.
> > >
> >
> > Which locks are you specifically interested in?
>
> Any locks they were holding before they entered the page allocator (e.g.
> i_mutex is the easiest one to trigger from the userspace but mmap_sem
> might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> mmap_sem held for write). Those would be locked until the page allocator
> returns, which with the current semantic might be _never_.
>

I agree that i_mutex seems to be one of the most common offenders.
However, I'm not sure I understand why holding it while trying to allocate
infinitely for an order-0 allocation is problematic wrt the proposed
kthread. The kthread itself need only take mmap_sem for read. If all
threads sharing the mm with a victim have been SIGKILL'd, they should get
TIF_MEMDIE set when reclaim fails and be able to allocate so that they can
drop mmap_sem. We must ensure that any holder of mmap_sem cannot quickly
deplete memory reserves without properly checking for
fatal_signal_pending().

> > We have already discussed
> > the usefulness of killing all threads on the system sharing the same ->mm,
> > meaning all threads that are either holding or want to hold mm->mmap_sem
> > will be able to allocate into memory reserves. Any allocator holding
> > down_write(&mm->mmap_sem) should be able to allocate and drop its lock.
> > (Are you concerned about MAP_POPULATE?)
>
> I am not sure I understand. We would have to fail the request in order
> the context which requested the memory could drop the lock. Are we
> talking about the same thing here?
>

Not fail the request, they should be able to allocate from memory reserves
when TIF_MEMDIE gets set. This would require that threads is all gfp
contexts are able to get TIF_MEMDIE set without an explicit call to
out_of_memory() for !__GFP_FS.

> > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem,
> > it's the reason the code exists. Any optimizations to that is certainly
> > welcome, but we definitely need to send SIGKILL to all threads sharing the
> > mm to make forward progress, otherwise we are going back to pre-2008
> > livelocks.
>
> Yes but mm is not shared between processes most of the time. CLONE_VM
> without CLONE_THREAD is more a corner case yet we have to crawl all the
> task_structs for _each_ OOM killer invocation. Yes this is an extreme
> slow path but still might take quite some unnecessarily time.
>

It must solve the issue you describe, killing other processes that share
the ->mm, otherwise we have mm->mmap_sem livelock. We are not concerned
about iterating over all task_structs in the oom killer as a painpoint,
such users should already be using oom_kill_allocating_task which is why
it was introduced.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/