Re: [v8 0/4] cgroup-aware OOM killer

From: David Rientjes
Date: Wed Sep 13 2017 - 16:46:16 EST


On Wed, 13 Sep 2017, Michal Hocko wrote:

> > > This patchset makes the OOM killer cgroup-aware.
> > >
> > > v8:
> > > - Do not kill tasks with OOM_SCORE_ADJ -1000
> > > - Make the whole thing opt-in with cgroup mount option control
> > > - Drop oom_priority for further discussions
> >
> > Nack, we specifically require oom_priority for this to function correctly,
> > otherwise we cannot prefer to kill from low priority leaf memcgs as
> > required.
>
> While I understand that your usecase might require priorities I do not
> think this part missing is a reason to nack the cgroup based selection
> and kill-all parts. This can be done on top. The only important part
> right now is the current selection semantic - only leaf memcgs vs. size
> of the hierarchy). I strongly believe that comparing only leaf memcgs
> is more straightforward and it doesn't lead to unexpected results as
> mentioned before (kill a small memcg which is a part of the larger
> sub-hierarchy).
>

The problem is that we cannot enable the cgroup-aware oom killer and
oom_group behavior because, without oom priorities, we have no ability to
influence the cgroup that it chooses. It is doing two things: providing
more fairness amongst cgroups by selecting based on cumulative usage
rather than single large process (good!), and effectively is removing all
userspace control of oom selection (bad). We want the former, but it
needs to be coupled with support so that we can protect vital cgroups,
regardless of their usage.

It is certainly possible to add oom priorities on top before it is merged,
but I don't see why it isn't part of the patchset. We need it before its
merged to avoid users playing with /proc/pid/oom_score_adj to prevent any
killing in the most preferable memcg when they could have simply changed
the oom priority.