Re: [patch 00/11] userspace out of memory handling

From: Andrew Morton
Date: Wed Mar 05 2014 - 16:17:55 EST


On Tue, 4 Mar 2014 19:58:38 -0800 (PST) David Rientjes <rientjes@xxxxxxxxxx> wrote:

> This patchset implements userspace out of memory handling.
>
> It is based on v3.14-rc5. Individual patches will apply cleanly or you
> may pull the entire series from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rientjes/linux.git mm/oom
>
> When the system or a memcg is oom, processes running on that system or
> attached to that memcg cannot allocate memory. It is impossible for a
> process to reliably handle the oom condition from userspace.
>
> First, consider only system oom conditions. When memory is completely
> depleted and nothing may be reclaimed, the kernel is forced to free some
> memory; the only way it can do so is to kill a userspace process. This
> will happen instantaneously and userspace can enforce neither its own
> policy nor collect information.
>
> On system oom, there may be a hierarchy of memcgs that represent user
> jobs, for example. Each job may have a priority independent of their
> current memory usage. There is no existing kernel interface to kill the
> lowest priority job; userspace can now kill the lowest priority job or
> allow priorities to change based on whether the job is using more memory
> than its pre-defined reservation.
>
> Additionally, users may want to log the condition or debug applications
> that are using too much memory. They may wish to collect heap profiles
> or are able to do memory freeing without killing a process by throttling
> or ratelimiting.
>
> Interactive users using X window environments may wish to have a dialogue
> box appear to determine how to proceed -- it may even allow them shell
> access to examine the state of the system while oom.
>
> It's not sufficient to simply restrict all user processes to a subset of
> memory and oom handling processes to the remainder via a memcg hierarchy:
> kernel memory and other page allocations can easily deplete all memory
> that is not charged to a user hierarchy of memory.
>
> This patchset allows userspace to do all of these things by defining a
> small memory reserve that is accessible only by processes that are
> handling the notification.
>
> Second, consider memcg oom conditions. Processes need no special
> knowledge of whether they are attached to the root memcg, where memcg
> charging will always succeed, or a child memcg where charging will fail
> when the limit has been reached. This allows those processes handling
> memcg oom conditions to overcharge the memcg by the amount of reserved
> memory. They need not create child memcgs with smaller limits and
> attach the userspace oom handler only to the parent; such support would
> not allow userspace to handle system oom conditions anyway.
>
> This patchset introduces a standard interface through memcg that allows
> both of these conditions to be handled in the same clean way: users
> define memory.oom_reserve_in_bytes to define the reserve and this
> amount is allowed to be overcharged to the process handling the oom
> condition's memcg. If used with the root memcg, this amount is allowed
> to be allocated below the per-zone watermarks for root processes that
> are handling such conditions (only root may write to
> cgroup.event_control for the root memcg).

If process A is trying to allocate memory, cannot do so and the
userspace oom-killer is invoked, there must be means via which process
A waits for the userspace oom-killer's action. And there must be
fallbacks which occur if the userspace oom killer fails to clear the
oom condition, or times out.

Would be interested to see a description of how all this works.


It is unfortunate that this feature is memcg-only. Surely it could
also be used by non-memcg setups. Would like to see at least a
detailed description of how this will all be presented and implemented.
We should aim to make the memcg and non-memcg userspace interfaces and
user-visible behaviour as similar as possible.

Patches 1, 2, 3 and 5 appear to be independent and useful so I think
I'll cherrypick those, OK?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/