Re: Linux killed Kenny, bastard!

From: David Rientjes
Date: Tue Jan 13 2009 - 14:36:55 EST


On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Don't you notice how many 'who' were placed and only single 'user space'
> answer? Becasue it is not an answer, it is a theoretical POV, which does
> not really work in practice, since it is way too unconvenient and
> error-prone, and actually it does not work when needed, since because of
> its complexity something will be missed. I've just talked with the
> admins who originally requested 'kill-by-name' feature why they did not
> work with /proc/.../oom_adj, and got a nice answer: we tries, but
> likely something went wrong and it did not work the way we wanted.
>
> There is no way to know that adjustment is correct, that everything was
> uptodate when oom happend, that nothing was forgotten and practice shows
> that there are always such problems and invalid tasks are killed.
>
> When you put a name you do know that it works, since it is only single
> place to be updated and no need to bother with ugly tools or changes
> especially to handle short-living processes.
>

The goal of the oom killer is to kill a rogue memory hogging task, which
will lead to future memory freeing once the task dies, and allow the
system or container to resume normal operation.

You're not realizing the power of /proc/pid/oom_adj: it allows you to tune
the badness scoring so that YOU, the user, may determine what the
definition of 'rogue' is on a task-by-task basis.

Your patch simply allows users to specify a task by name that will always
be killed first when the oom killer is invoked. That's terribly
insufficient if another task uses an excessive amount of memory that you
didn't expect; a rogue task may be leaking memory and the task you've
identified by name with your patch is repeatedly forked and killed when
the rogue task goes untouched.

With oom_adj scores, you can easily specify at what point each task should
be considered rogue. You can elevate the oom_adj score for those you have
a preference to kill and reduce the oom_adj score for those that you'd
prefer being deferred _unless_ they get sufficiently out of hand.

Your patch presents a shortcut where the entire badness scoring (and,
thus, all oom_adj scores) is ignored if the named task exists. That not
only has syncronization issues, but also can cause the kernel to loop
forever in killing a task by the same name without ever freeing memory for
anything else.

Additionally, your patch completely breaks cpuset oom killing since
candidacy is determined in badness() because a task may have allocated
non-migrated memory elsewhere before being moved to a different cpuset.
Your oom_victim_name task may exist globally, but will always be
identified for oom kill even when the oom exists exclusively in a disjoint
cpuset. That does _not_ lead to future memory freeing that current can
use, and if the parent of the killed task decides to immediately fork
another instance, this cpuset will be completely livelocked.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/