Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_pointslogic

From: David Rientjes
Date: Mon Mar 14 2011 - 16:41:56 EST


On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> oom_kill_process() starts with victim_points == 0. This means that
> (most likely) any child has more points and can be killed erroneously.
>
> Also, "children has a different mm" doesn't match the reality, we
> should check child->mm != t->mm. This check is not exactly correct
> if t->mm == NULL but this doesn't really matter, oom_kill_task()
> will kill them anyway.
>
> Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> too.
>

There're two issues you're addressing in this patch. It only kills a
child in place of its selected parent when:

- the child has a higher badness score, and

- it has a different ->mm.

In the former case, NACK, we always want to sacrifice children regardless
of their badness score (as long as it is non-zero) if it has a separate
->mm in place of its parent, otherwise webservers will be killed instead
of one of their children serving a client, sshd could be killed instead of
bash, etc. The behavior of the oom killer has always been to try to kill
a child with its own ->mm first to avoid losing a large amount of work
being done or unnecessarily killing a job scheduler, for example, when
sacrificing a child would be satisfactory. It'll kill additional tasks,
and perhaps even the parent later if it has no more children, if the oom
condition persists.

In the latter case, I agree, we should be testing if the child has a
different ->mm before sacrificing it for its parent as the comment
indicates it will. I proposed that exact change in "oom: avoid deferring
oom killer if exiting task is being traced" posted to -mm a couple days
ago.

> Signed-off-by: Oleg Nesterov <oleg@xxxxxxxxxx>
> ---
>
> mm/oom_kill.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100
> @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
> struct mem_cgroup *mem, nodemask_t *nodemask,
> const char *message)
> {
> - struct task_struct *victim = p;
> + struct task_struct *victim;
> struct task_struct *child;
> - struct task_struct *t = p;
> - unsigned int victim_points = 0;
> + struct task_struct *t;
> + unsigned int victim_points;
>
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
> * parent. This attempts to lose the minimal amount of work done while
> * still freeing memory.
> */
> + victim_points = oom_badness(p, mem, nodemask, totalpages);
> + victim = p;
> + t = p;
> do {
> list_for_each_entry(child, &t->children, sibling) {
> unsigned int child_points;
>
> + if (child->mm == t->mm)
> + continue;
> /*
> * oom_badness() returns 0 if the thread is unkillable
> */
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/