Re: [PATCH] copy over oom_adj value at fork time

From: David Rientjes
Date: Fri Jul 17 2009 - 05:34:24 EST


On Thu, 16 Jul 2009, Paul Menage wrote:

> How about if instead of having the oom_adj be per-mm, we kept an array
> of counters in the mm, tracking how many users were at each oom_adj
> level; the OOM killer could then use the level of the mm's highest
> oom_adj user when deciding how to calculate the badness of a thread
> using that mm.
>

That would lead to the same inconsistencies that we had before: consider
two tasks sharing the same mm_struct, taskA and taskB. It was previously
possible for taskA to have an oom_adj value of -15 and taskB to have an
oom_adj value of +15. This would cause /proc/pid/oom_score to be very
small for taskA and oom_score would be very large for taskB. With your
proposal, taskB's badness score would implicitly be very small, yet it is
reported to userspace as very high.

The only way to workaround that is by using the highest oom_adj user for
the mm_struct from the array in reporting /proc/pid/oom_score, as well.
But that would lead to /proc/pid/oom_adj not affecting oom_score at all,
which isn't consistent.

I think you'll find that having oom_adj values purely be an attribute of
the memory it represents is the cleanest solution since it most accurately
describes how the oom killer interprets it when deciding on which task to
kill.

> That would preserve the previous semantics of letting a spawned child
> inherit a per-thread oom_adj value, while avoiding the specific
> problem of the OOM killer getting livelocked (that David's patch
> originally addressed) and the more general case of the inconsistency
> in determining the oom_adj level of an mm depending on which thread
> you look at.
>

Right, it's still a little strange that changing /proc/pid/oom_adj for one
thread will change it for another if they share memory, even if they are
in different thread groups, but that shouldn't happen if the admin
understands that the oom killer must kill _all_ threads sharing memory
with the target to lead to future memory freeing.

The inheritance issue should be fixed with Rik's patch with the exception
of vfork -> change /proc/pid-of-child/oom_adj -> execve. If scripts were
written to do that with the old behavior, they'll have to adjust to change
oom_adj _after_ the execve to avoid changing the oom_adj value of the
vfork parent. If there is no execve, or we're just doing CLONE_VM, then
the child shares memory with the parent and, thus, their oom_adj values
will be the same.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/