Re: [patch] oom: give current access to memory reserves if it hasbeen killed

From: David Rientjes
Date: Wed Mar 31 2010 - 17:07:31 EST


On Wed, 31 Mar 2010, Oleg Nesterov wrote:

> On 03/30, David Rientjes wrote:
> >
> > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> >
> > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > current has already passed exit_notify().
> >
> > We have an even bigger problem if current is in the oom killer at
> > exit_notify() since it has already detached its ->mm in exit_mm() :)
>
> Can't understand... I thought that in theory even kmalloc(1) can trigger
> oom.
>

__oom_kill_task() cannot be called on a task without an ->mm.

> > > IOW, unless I missed something, it is very easy to hide the process
> > > from oom-kill:
> > >
> > > int main()
> > > {
> > > pthread_create(memory_hog_func);
> > > syscall(__NR_exit);
> > > }
> > >
> >
> > The check for !p->mm was moved in the -mm tree (and the oom killer was
> > entirely rewritten in that tree, so I encourage you to work off of it
> > instead
>
> OK, but I guess this !p->mm check is still wrong for the same reason.
> In fact I do not understand why it is needed in select_bad_process()
> right before oom_badness() which checks ->mm too (and this check is
> equally wrong).
>

It prevents kthreads from being killed. We already identify tasks that
are in the exit path with PF_EXITING in select_bad_process() and chosen to
make the oom killer a no-op when it's not current so it can exit and free
its memory. If it is current, then we're ooming in the exit path and we
need to oom kill it so that it gets access to memory reserves so its no
longer blocking.

> > so if the oom killer finds an already exiting task,
> > it will become a no-op since it should eventually free memory and avoids a
> > needless oom kill.
>
> No, afaics, And this reminds that I already complained about this
> PF_EXITING check.
>
> Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
> is set) but it can have sub-threads. This means, unless I missed something,
> any user can trivially disable select_bad_process() forever.
>

The task is in the process of exiting and will do so if its not current,
otherwise it will get access to memory reserves since we're obviously oom
in the exit path. Thus, we'll be freeing that memory soon or recalling
the oom killer to kill additional tasks once those children have been
reparented (or one of its children was sacrificed).

>
> Well. Looks like, -mm has a lot of changes in oom_kill.c. Perhaps it
> would be better to fix these mt bugs first...
>
> Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> Again, this is not right even if we forget about !child->mm check.
> This list_for_each_entry() can only see the processes forked by the
> main thread.
>

That's the intention.

> Likewise, oom_kill_process()->list_for_each_entry() is not right too.
>

Why?

> Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
>

Right, but we need to ensure that the check for !child->mm || child->mm ==
tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
detach its mm prior to the dereference. It would be possible to move the
thread_group_cputime() out of this critical section, but I felt it was
better to do filter all tasks with child->mm == tsk->mm first before
unnecessarily finding the cputime for them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/