Re: [v8 0/4] cgroup-aware OOM killer

From: Roman Gushchin
Date: Tue Sep 26 2017 - 07:00:32 EST


On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> [...]
> > I'm not against this model, as I've said before. It feels logical,
> > and will work fine in most cases.
> >
> > In this case we can drop any mount/boot options, because it preserves
> > the existing behavior in the default configuration. A big advantage.
>
> I am not sure about this. We still need an opt-in, ragardless, because
> selecting the largest process from the largest memcg != selecting the
> largest task (just consider memcgs with many processes example).

As I understand Johannes, he suggested to compare individual processes with
group_oom mem cgroups. In other words, always select a killable entity with
the biggest memory footprint.

This is slightly different from my v8 approach, where I treat leaf memcgs
as indivisible memory consumers independent on group_oom setting, so
by default I'm selecting the biggest task in the biggest memcg.

While the approach suggested by Johannes looks clear and reasonable,
I'm slightly concerned about possible implementation issues,
which I've described below:

>
> > The only thing, I'm slightly concerned, that due to the way how we calculate
> > the memory footprint for tasks and memory cgroups, we will have a number
> > of weird edge cases. For instance, when putting a single process into
> > the group_oom memcg will alter the oom_score significantly and result
> > in significantly different chances to be killed. An obvious example will
> > be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
> > value, but it can also happen in case of constrained alloc, for instance.
>
> I am not sure I understand. Are you talking about root memcg comparing
> to other memcgs?

Not only, but root memcg in this case will be another complication. We can
also use the same trick for all memcg (define memcg oom_score as maximum oom_score
of the belonging tasks), it will turn group_oom into pure container cleanup
solution, without changing victim selection algorithm

But, again, I'm not against approach suggested by Johannes. I think that overall
it's the best possible semantics, if we're not taking some implementation details
into account.