[PATCH] memcg: make oom_lock 0 and 1 based rather than coutner

From: Michal Hocko
Date: Wed Jul 13 2011 - 07:05:49 EST


867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we
have many processes triggering OOM.

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
other processes are blocked on the mutex.
While A releases the mutex and calls mem_cgroup_out_of_memory others
will wake up (one after another) and increase the counter and fall into
sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
takes the mutex again and decreases oom_lock and wakes other tasks (if
releasing memory of the killed task hasn't done it yet).
The main problem here is that everybody still race for the mutex and
there is no guarantee that we will get counter back to 0 for those
that got back to mem_cgroup_handle_oom. In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful is going on.
The time is basically unbounded because it highly depends on scheduling
and ordering on mutex.

This patch replaces the counter by a simple {un}lock semantic. We are
using only 0 and 1 to distinguish those two states.
As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
to make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex. All other consumers just read the value atomically for
a single group which is sufficient because we set the value atomically.
mem_cgroup_oom_lock has to be really careful because we might be in
higher in a hierarchy than already oom locked subtree of the same
hierarchy:
A
/ \
B \
/\ \
C D E

B - C - D tree might be already locked. While we want to enable locking E
subtree because OOM situations cannot influence each other we definitely
do not want to allow locking A.
Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.
Unlock path is then very easy because we always unlock only that subtree
we have locked previously.

Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
---
mm/memcontrol.c | 48 +++++++++++++++++++++++++++++++++++++++---------
1 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..29f00d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1803,22 +1803,51 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
+ * Has to be called with memcg_oom_mutex
*/
static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
{
- int x, lock_count = 0;
- struct mem_cgroup *iter;
+ int x, lock_count = -1;
+ struct mem_cgroup *iter, *failed = NULL;
+ bool cond = true;

- for_each_mem_cgroup_tree(iter, mem) {
- x = atomic_inc_return(&iter->oom_lock);
- lock_count = max(x, lock_count);
+ for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+ x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
+ if (lock_count == -1)
+ lock_count = x;
+ else if (lock_count != x) {
+ /*
+ * this subtree of our hierarchy is already locked
+ * so we cannot give a lock.
+ */
+ lock_count = 0;
+ failed = iter;
+ cond = false;
+ }
}

- if (lock_count == 1)
- return true;
- return false;
+ if (!failed)
+ goto done;
+
+ /*
+ * OK, we failed to lock the whole subtree so we have to clean up
+ * what we set up to the failing subtree
+ */
+ cond = true;
+ for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+ if (iter == failed) {
+ cond = false;
+ continue;
+ }
+ atomic_set(&iter->oom_lock, 0)
+ }
+done:
+ return lock_count;
}

+/*
+ * Has to be called with memcg_oom_mutex
+ */
static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
{
struct mem_cgroup *iter;
@@ -1916,7 +1945,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
finish_wait(&memcg_oom_waitq, &owait.wait);
}
mutex_lock(&memcg_oom_mutex);
- mem_cgroup_oom_unlock(mem);
+ if (locked)
+ mem_cgroup_oom_unlock(mem);
memcg_wakeup_oom(mem);
mutex_unlock(&memcg_oom_mutex);

--
1.7.5.4


--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/