Re: [RFC] memory cgroup: my thoughts on memsw

From: Kamezawa Hiroyuki
Date: Mon Sep 15 2014 - 21:36:28 EST

Next message: Wang, Yalin: "[RFC v3] arm:extend the reserved mrmory for initrd to be page aligned"
Previous message: Chen Gang: "Re: [PATCH 4/4] drivers/xen/xenbus/xenbus_client.c: Improve the failure processing for __xenbus_switch_state()"
In reply to: Johannes Weiner: "Re: [RFC] memory cgroup: my thoughts on memsw"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(2014/09/16 4:14), Johannes Weiner wrote:

Hi Vladimir,

On Thu, Sep 04, 2014 at 06:30:55PM +0400, Vladimir Davydov wrote:

To sum it up, the current mem + memsw configuration scheme doesn't allow
us to limit swap usage if we want to partition the system dynamically
using soft limits. Actually, it also looks rather confusing to me. We
have mem limit and mem+swap limit. I bet that from the first glance, an
average admin will think it's possible to limit swap usage by setting
the limits so that the difference between memory.memsw.limit and
memory.limit equals the maximal swap usage, but (surprise!) it isn't
really so. It holds if there's no global memory pressure, but otherwise
swap usage is only limited by memory.memsw.limit! IMHO, it isn't
something obvious.

Agreed, memory+swap accounting & limiting is broken.

- Anon memory is handled by the user application, while file caches are
all on the kernel. That means the application will *definitely* die
w/o anon memory. W/o file caches it usually can survive, but the more
caches it has the better it feels.

- Anon memory is not that easy to reclaim. Swap out is a really slow
process, because data are usually read/written w/o any specific
order. Dropping file caches is much easier. Typically we have lots of
clean pages there.

- Swap space is limited. And today, it's OK to have TBs of RAM and only
several GBs of swap. Customers simply don't want to waste their disk
space on that.

Finally, my understanding (may be crazy!) how the things should be
configured. Just like now, there should be mem_cgroup->res accounting
and limiting total user memory (cache+anon) usage for processes inside
cgroups. This is where there's nothing to do. However, mem_cgroup->memsw
should be reworked to account *only* memory that may be swapped out plus
memory that has been swapped out (i.e. swap usage).

But anon pages are not a resource, they are a swap space liability.
Think of virtual memory vs. physical pages - the use of one does not
necessarily result in the use of the other. Without memory pressure,
anonymous pages do not consume swap space.

What we *should* be accounting and limiting here is the actual finite
resource: swap space. Whenever we try to swap a page, its owner
should be charged for the swap space - or the swapout be rejected.

For hard limit reclaim, the semantics of a swap space limit would be
fairly obvious, because it's clear who the offender is.

However, in an overcommitted machine, the amount of swap space used by
a particular group depends just as much on the behavior of the other
groups in the system, so the per-group swap limit should be enforced
even during global reclaim to feed back pressure on whoever is causing
the swapout. If reclaim fails, the global OOM killer triggers, which
should then off the group with the biggest soft limit excess.

As far as implementation goes, it should be doable to try-charge from
add_to_swap() and keep the uncharging in swap_entry_free().

We'll also have to extend the global OOM killer to be memcg-aware, but
we've been meaning to do that anyway.

When we introduced memsw limitation, we tried to avoid affecting global memory reclaim.
Then, we did memory+swap limitation.

Now, global memory reclaim is memcg-aware. So, I think swap-limitation rather than
anon+swap may be a choice. The change will reduce res_counter access. Hmm, it will be
desireble to move anon pages to Unevictable if memcg's swap slot is 0.

Anyway, I think softlimit should be re-implemented, 1st. It will be starting point.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Wang, Yalin: "[RFC v3] arm:extend the reserved mrmory for initrd to be page aligned"
Previous message: Chen Gang: "Re: [PATCH 4/4] drivers/xen/xenbus/xenbus_client.c: Improve the failure processing for __xenbus_switch_state()"
In reply to: Johannes Weiner: "Re: [RFC] memory cgroup: my thoughts on memsw"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]