Re: [BUGFIX for 2.6.36][RESEND][PATCH 1/2] oom: remove totalpagenormalization from oom_badness()

From: David Rientjes
Date: Wed Sep 08 2010 - 04:24:52 EST

Next message: Eric Dumazet: "Re: slow nanosleep?"
Previous message: Tvrtko Ursulin: "Re: [BUG][PATCH][2.6.36-rc3] fanotify: Do not ignore result of permission decisions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 7 Sep 2010, David Rientjes wrote:

> Thus, my introduction of oom_score_adj causes no regression for real-world
> users of /proc/pid/oom_adj and allows users of cgroups and mempolicies a
> much more powerful interface to tune oom killing priority.
>

I want to elaborate on this point just a little more because I feel that
Andrew is in the unpleasant position of trying to judge whether the
introduction of this feature actually has a negative side-effect for
current users of oom_adj, and I want to stress that it doesn't.

The problem being reported here is that current users of oom_adj will now
experience altered semantics of the value if it doesn't lie at the
absolute extremes such as +15 or -16. The prior implementation would, in
its simpliest form, do this:

badness_score = task->mm->total_vm << oom_adj

for a positive oom_adj and this:

badness_score = task->mm->total_vm >> -oom_adj

for a negative oom_adj. That implementation would require that the task
setting oom_adj would know the expected RAM usage of the task at the
time of oom and the system capacity. Otherwise, it would be impossible to
judge an approriate value for the bitshift if we didn't know the total VM
size of task and how the score would rank in comparison to other tasks.

The comparison is important because we use these scores as an indicator of
the "memory hogging" task to kill and we know that the system is oom so
the remainder of system RAM that this application is not using may be
consumed by another that we would therefore want to select instead.

It's my contention that nobody is currently using oom_adj in that way, and
if they are, they are more deserving of a finer-grained utility that
doesn't operate exponentially on the VM size, which leaves much to be
desired. oom_score_adj, for these users that aren't using cpusets or
memcg or mempolicies, allows for the same static value that scales
linearly as opposed to exponentially.

Current users of oom_adj do one of two things:

- polarize the value such that a task is always preferred (+15) or
always protected (-16) [or, completely disabled, -17], or

- set up very rough approximations of oom killing priority such as
one task at +10 and another at +5.

The latter is certainly in the very, very minority and have arbitrary
values to imply that the former task should be selected over the latter
iff the latter has exploded in memory use such that it's using much more
than expected and probably leaking memory.

My contention that we are safe in proceeding with what is currently in
2.6.36-rc3 is based on the assumption that all users currently use oom_adj
in one of the two ways enumerated above. If that assumption isn't
accepted, then I believe the revert criteria is to show a user, any user,
of oom_adj that considers both expected memory usage and system capacity
of the application it is tuning the value for and does so in comparison to
other tasks on the system. Unless that can be shown, I do not believe the
revert criteria has been met in this case.

Furthermore, the characterization of the above as being a "bug" that
affects endusers is inaccurate. The vast majority of Linux users do not
use cpusets, memcg, mempolicies, or memory hotplug. For those users, the
proportional scores that are used by oom_score_adj stay static since the
system capacity remains static. They will find that oom_score_adj is
exceptionally more powerful for the users that this is being inaccurately
described as imposing a regression for: if they are tuning oom_adj based
on the specific memory usage of their applications and the system
capacity, they may now do so with a rough linear approximation via oom_adj
as they always did or use oom_score_adj with a _much_ higher resolution
(1/1000th of RAM) that scales linearly and not exponentially.

For the users of cpusets or memcg, existing users of oom_adj will see a
rough approximation of the priority (and always an exact equivalent in the
high-probability case where they are polarizing the task with +15 or -17)
while using the old interface. The badness score _will_ change if the set
of cpuset mems or the memcg limit changes, which is new behavior. This is
the exact equivalent according to the oom killer as if we were using
memory hotplug on a system and hot-adding memory or hot-removing memory:
badness scores that have a certain value may no longer have the same kill
priority because we're using more or less memory. Considering the
unconstrained, system-wide oom case: if we hot-added memory and are now
oom and the memory usage of a task hasn't changed, it may no longer be the
first task to be killed anymore because another task's usage may now cause
its badness score to exceed the former. Likewise, if we hot-removed
memory and are now oom and the memory usage of a task hasn't changed, it
may now be selected first because all other task's usage may now be less
than ours.

That's the exact same behavior as oom_score_adj when constrained to a
cpuset or memcg. See either of them as a virtualized system with a set of
resources and an aggregate of tasks competing for those resources.
Priorities will change as memory is allowed or restricted, just like we
always have had with memory hotplug, but we now allow users to define the
proportion of that working set they are allowed instead of a static value.
Static oom_adj scores never work appropriately in a dynamic environment
because we don't know the capacity when we set the value (remember the two
prerequisites to use oom_adj: expected RAM usage of the application, and
memory capacity available to it).

For the users of mempolicies, the entire oom killer rewrite has changed
how those ooms are handled: prior to the rewrite, the oom killer would
simply kill current. Now, the tasklist is iterated if the
oom_kill_allocating_task sysctl is not selected and the badness scores are
used to select a task. The prior behavior of oom_adj in these oom
contexts are therefore unrelated to this discussion.

This explains the power and necessity of oom_score_adj for users who use
cpusets, memcg, or mempolicies. Those environments are dynamic and we
_can't_ expect oom_adj to be written anytime a task changes cgroups or a
cpuset mem is added, a memcg limit is reduced, or a node is added to a
mempolicy. We _can_ expect the admin to know the priority of killing jobs
relative to others competing for the same set of now fully depleted
memory if they are using the tunable.

Given the above, it's not possible to meet the revert criteria. The real
question to be asking in this case is not whether we need to revert
oom_score_adj, but rather whether we need dual interfaces to exist:
oom_adj and oom_score_adj. I believe that we should only have a single
interface available in the kernel, and since oom_score_adj is much more
powerful than oom_adj, acts on a higher resolution, respects the dynamic
nature of cgroups, provides a rough approximation to users of oom_adj, and
an exact equivalent of polarizing users of oom_adj, that it should exist
and oom_adj should be deprecated.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric Dumazet: "Re: slow nanosleep?"
Previous message: Tvrtko Ursulin: "Re: [BUG][PATCH][2.6.36-rc3] fanotify: Do not ignore result of permission decisions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]