Re: can't oom-kill zap the victim's memory?

From: David Rientjes
Date: Thu Sep 24 2015 - 17:15:41 EST


On Wed, 23 Sep 2015, Michal Hocko wrote:

> I am still not sure how you want to implement that kernel thread but I
> am quite skeptical it would be very much useful because all the current
> allocations which end up in the OOM killer path cannot simply back off
> and drop the locks with the current allocator semantic. So they will
> be sitting on top of unknown pile of locks whether you do an additional
> reclaim (unmap the anon memory) in the direct OOM context or looping
> in the allocator and waiting for kthread/workqueue to do its work. The
> only argument that I can see is the stack usage but I haven't seen stack
> overflows in the OOM path AFAIR.
>

Which locks are you specifically interested in? We have already discussed
the usefulness of killing all threads on the system sharing the same ->mm,
meaning all threads that are either holding or want to hold mm->mmap_sem
will be able to allocate into memory reserves. Any allocator holding
down_write(&mm->mmap_sem) should be able to allocate and drop its lock.
(Are you concerned about MAP_POPULATE?)

> > Finally. Whatever we do, we need to change oom_kill_process() first,
> > and I think we should do this regardless. The "Kill all user processes
> > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > I'll try to make some patches tomorrow if I have time...
>
> That would be appreciated. I do not like that part either. At least we
> shouldn't go over the whole list when we have a good chance that the mm
> is not shared with other processes.
>

Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem,
it's the reason the code exists. Any optimizations to that is certainly
welcome, but we definitely need to send SIGKILL to all threads sharing the
mm to make forward progress, otherwise we are going back to pre-2008
livelocks.

> Yes I am not really sure why oom_score_adj is not per-mm and we are
> doing that per signal struct to be honest. It doesn't make much sense as
> the mm_struct is the primary source of information for the oom victim
> selection. And the fact that mm might be shared withtout sharing signals
> make it double the reason to have it in mm.
>
> It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj
> value from task_struct to mm_struct") but it was later reverted by
> 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree
> with the reasoning there because vfork is documented to have undefined
> behavior
> "
> if the process created by vfork() either modifies any data other
> than a variable of type pid_t used to store the return value
> from vfork(), or returns from the function in which vfork() was
> called, or calls any other function before successfully calling
> _exit(2) or one of the exec(3) family of functions.
> "
> Maybe we can revisit this... It would make the whole semantic much more
> straightforward. The current situation when you kill a task which might
> share the mm with OOM unkillable task is clearly suboptimal and
> confusing.
>

How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj
value from task_struct to signal_struct")? We also must appreciate the
real-world usecase for an oom disabled process doing fork(), setting
/proc/child/oom_score_adj to non-disabled, and exec().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/