Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

From: David Rientjes
Date: Sat Sep 09 2017 - 04:46:16 EST


On Fri, 8 Sep 2017, Christopher Lameter wrote:

> Ok. Certainly there were scalability issues (lots of them) and the sysctl
> may have helped there if set globally. But the ability to kill the
> allocating tasks was primarily used in cpusets for constrained allocation.
>

I remember discussing it with him and he had some data with pretty extreme
numbers for how long the tasklist iteration was taking. Regardless, I
agree it's not pertinent to the discussion if anybody is actively using
the sysctl, just fun to try to remember the discussions from 10 years ago.

The problem I'm having with the removal, though, is that the kernel source
actually uses it itself in tools/testing/fault-injection/failcmd.sh.
That, to me, suggests there are people outside the kernel source that are
also probably use it. We use it as part of our unit testing, although we
could convert away from it.

These are things that can probably be worked around, but I'm struggling to
see the whole benefit of it. It's only defined, there's generic sysctl
handling, and there's a single conditional in the oom killer. I wouldn't
risk the potential userspace breakage.

> The issue of scaling is irrelevant in the context of deciding what to do
> about the sysctl. You can address the issue differently if it still
> exists. The systems with super high NUMA nodes (hundreds to a
> thousand) have somehow fallen out of fashion a bit. So I doubt that this
> is still an issue. And no one of the old stakeholders is speaking up.
>
> What is the current approach for an OOM occuring in a cpuset or cgroup
> with a restricted numa node set?
>

It's always been shaky, we simply exclude potential kill victims based on
whether or not they share mempolicy nodes or cpuset mems with the
allocating process. Of course, this could result in no memory freeing
because a potential victim being allowed to allocate on a particular node
right now doesn't mean killing it will free memory on that node. It's
just more probable in practice. Nobody has complained about that
methodology, but we do have internal code that simply kills current for
mempolicy ooms. That is because we have priority based oom killing much
like this patchset implements and then extends it even further to
processes.