Re: [RESEND v12 0/6] cgroup-aware OOM killer

From: David Rientjes
Date: Sun Oct 22 2017 - 20:25:04 EST


On Thu, 19 Oct 2017, Johannes Weiner wrote:

> David would have really liked for this patchset to include knobs to
> influence how the algorithm picks cgroup victims. The rest of us
> agreed that this is beyond the scope of these patches, that the
> patches don't need it to be useful, and that there is nothing
> preventing anyone from adding configurability later on. David
> subsequently nacked the series as he considers it incomplete. Neither
> Michal nor I see technical merit in David's nack.
>

The nack is for three reasons:

(1) unfair comparison of root mem cgroup usage to bias against that mem
cgroup from oom kill in system oom conditions,

(2) the ability of users to completely evade the oom killer by attaching
all processes to child cgroups either purposefully or unpurposefully,
and

(3) the inability of userspace to effectively control oom victim
selection.

For (1), the difference in v12 is adding the rss of all processes attached
to the root mem cgroup as the evaluation. This is not the same criteria
that child cgroups are evaluated on, and they are compared using that
bias. It is very trivial to provide a fair comparison as was suggested in
v11.

For (2), users who do

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

can completely evade the oom killer and this may be part of their standard
operating procedure for restricting resources with other cgroups other
than the mem cgroup. Again, it's an unfair comparison to all other
cgroups on the system.

For (3), users need the ability to protect important cgroups, such as
protecting a cgroup that is limited to 50% of system memory. They need
the ability to kill processes from other cgroups to protect these
important processes. This is nothing new: the oom killer has always
provided the ability to bias against important processes.

There was follow-up email on all of these points where very trivial
changes were suggested to address all three of these issues, and which
Roman has implemented in one form or another in previous iterations with
the bonus that no accounting to the root mem cgroup needs to be done.

> Michal acked the implementation, but on the condition that the new
> behavior be opt-in, to not surprise existing users. I *think* we agree
> that respecting the cgroup topography during global OOM is what we
> should have been doing when cgroups were initially introduced; where
> we disagree is that I think users shouldn't have to opt in to
> improvements. We have done much more invasive changes to the victim
> selection without actual regressions in the past. Further, this change
> only applies to mounts of the new cgroup2. Tejun also wasn't convinced
> of the risk for regression, and too would prefer cgroup-awareness to
> be the default in cgroup2. I would ask for patch 5/6 to be dropped.
>

I agree with Michal that the new victim selection should be opt-in with
CGRP_GROUP_OOM.