Re: user defined OOM policies

From: David Rientjes
Date: Wed Nov 20 2013 - 02:50:48 EST


On Tue, 19 Nov 2013, Michal Hocko wrote:

> Hi,
> it's been quite some time since LSFMM 2013 when this has been
> discussed[1]. In short, it seems that there are usecases with a
> strong demand on a better user/admin policy control for the global
> OOM situations. Per process oom_{adj,score} which is used for the
> prioritizing is no longer sufficient because there are other categories
> which might be important. For example, often it doesn't make sense to
> kill just a part of the workload and killing the whole group would be a
> better fit. I am pretty sure there are many others some of them workload
> specific and thus not appropriate for the generic implementation.
>

Thanks for starting this thread. We'd like to have two things:

- allow userspace to call into our implementation of malloc() to free
excess memory that will avoid requiring anything from being killed,
which may include freeing userspace caches back to the kernel or
using MADV_DONTNEED over a range of unused memory within the arena,
and

- enforce a hierarchical memcg prioritization policy so that memcgs can
be iterated at each level beneath the oom memcg (which may include the
root memcg for system oom conditions) and eligible processes are killed
in the lowest priority memcg.

This obviously allows for much more powerful implementations as well that
can be defined by users of memcgs to drop caches, increase memcg limits,
signaling applications to free unused memory, start throttling memory
usage, heap analysis, logging, etc. and userspace oom handlers are the
perfect place to do so.

> We have basically ended up with 3 options AFAIR:
> 1) allow memcg approach (memcg.oom_control) on the root level
> for both OOM notification and blocking OOM killer and handle
> the situation from the userspace same as we can for other
> memcgs.

This is what I've been proposing both with my latest patches, the
memory.oom_delay_millisecs patch in the past, and future patch to allow
for per-memcg memory reserves that allow charging to be bypassed to a
pre-defined threshold much like per-zone memory reserves for TIF_MEMDIE
processes today so that userspace has access to memory to handle the
situation even in system oom conditions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/