Re: [patch] mm, memcg: add oom killer delay

From: David Rientjes
Date: Wed Jun 26 2013 - 19:18:20 EST

Next message: Andrew Morton: "Re: [PATCH] nbd: correct disconnect behavior"
Previous message: Pavel Machek: "Re: [PATCH v3] power: Add warning message about incorrect systemdate and time after resume when pm_trace is enabled"
In reply to: Kamezawa Hiroyuki: "Re: [patch] mm, memcg: add oom killer delay"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 25 Jun 2013, Kamezawa Hiroyuki wrote:

> Considering only memcg, bypassing all charge-limit-check will work.
> But as you say, that will not work against global-oom.

I think it will since we have per-zone memory reserves that can be
bypassed in the page allocator, not to the level of PF_MEMALLOC or
TIF_MEMDIE but perhaps to the min watermark / 4, for example. A userspace
global oom handler will obviously be mlocked in memory and this reserve is
used only for true kmem accounting so that reading things like the memcg
tasks file or reading /proc/pid/stat works, or dynamically allocate a
buffer to store data to iterate over. This is why
memory.oom_delay_millisecs is crucial: we want the same functionality that
the "user root" has for global oom conditions at the memcg root level and
in case reserves are exhausted that the kernel will kill something (which
should be rare, but possible) and use the rest of memory reserves to allow
to exit.

> > Even with all of the above (which is not actually that invasive of a
> > patch), I still think we need memory.oom_delay_millisecs. I probably made
> > a mistake in describing what that is addressing if it seems like it's
> > trying to address any of the above.
> >
> > If a userspace oom handler fails to respond even with access to those
> > "memcg reserves",
>
> How this happens ?
>

If the memcg reserves are exhausted, then the kernel needs to kill
something even in global oom conditions (consider a "user root" memcg tree
to be the same as a global oom condition for processes attached to that
tree) since otherwise the machine hangs. There's no guarantee that some
root process sitting in the root memcg would be able to enforce this delay
as Michal suggests since reserves could be depleted. It's important we
don't do something as extreme as PF_MEMALLOC so all per-zone reserves are
depleted so that the kernel can still intervene and kill something when
userspace is unresponsive.

> Someone may be against that kind of control and say "Hey, I have better idea".
> That was another reason that oom-scirpiting was discussed. No one can
> implement
> general-purpose-victim-selection-logic.
>

Completely agreed, and our experience shows that users who manipulate
their own "user root" memcgs have their own logic, this is why we're
trying to make userspace oom handling as powerful as possible without
risking making the machine unresponsive.

> IMHO, it will be difficult but allowing to write script/filter for oom-killing
> will be worth to try. like..
>
> ==
> for_each_process :
> if comm == mem_manage_daemon :
> continue
> if user == root :
> continue
> score = default_calc_score()
> if score > high_score :
> selected = current
> ==
>

This is effectively what already happens with the oom delay as proposed
here, the userspace oom handler is given access to "memcg reserves" and a
period of time to respond; if that fails, then the kernel will kill
something the next time we try to charge to the memcg.

> BTW, if you love the logic in the userland oom daemon, why you can't implement
> it in the kernel ? Does that do some pretty things other than sending SIGKILL
> ?
>

Some do "pretty" things like collect stats and dump it before killing
something, but we also want oom handlers that don't do SIGKILL at all. An
example: we statically allocate hugepages at boot because we need a large
percentage of memory to be backed by hugepages for a certain class of
applications and it's only available at boot. We also have a userspace
that runs on these machines that is shared between hugepage machines and
non-hugepage machines. At times, this userspace becomes oom because the
remainder of available memory is allocated as hugetlb pages when in
reality they are unmapped and sitting in a free pool. In that case, our
userspace oom handler wants to free those hugetlb pages back to the kernel
down to a certain watermark and then opportunistically reallocate them to
the pool when memory usage on the system is lower due to spikes in the
userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: [PATCH] nbd: correct disconnect behavior"
Previous message: Pavel Machek: "Re: [PATCH v3] power: Add warning message about incorrect systemdate and time after resume when pm_trace is enabled"
In reply to: Kamezawa Hiroyuki: "Re: [patch] mm, memcg: add oom killer delay"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]