Re: memcg creates an unkillable task in 3.11-rc2

From: Eric W. Biederman
Date: Tue Jul 30 2013 - 04:20:43 EST

Next message: Jingoo Han: "[PATCH 34/35] regulator: use dev_get_platdata()"
Previous message: Peter Zijlstra: "Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks"
In reply to: Li Zefan: "Re: memcg creates an unkillable task in 3.2-rc2"
Next in thread: Michal Hocko: "Re: memcg creates an unkillable task in 3.11-rc2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Li Zefan <lizefan@xxxxxxxxxx> writes:

>> I am also seeing what looks like a leak somewhere in the cgroup code as
>> well. After some runs of the same reproducer I get into a state where
>> after everything is clean up. All of the control groups have been
>> removed and the cgroup filesystem is unmounted, I can mount a cgroup
>> filesystem with that same combindation of subsystems, but I can't mount
>> a cgroup filesystem with any of those subsystems in any other
>> combination. So I am guessing that the superblock is from the original
>> mounting is still lingering for some reason.
>>
>
> If this happens again, you can check /proc/cgroups,
>
> #subsys_name hierarchy num_cgroups enabled
> cpuset 0 1 1
> debug 0 1 1
> cpu 0 1 1
> cpuacct 0 1 1
> memory 0 1 1
> devices 0 1 1
> freezer 0 1 1
> blkio 0 1 1
>
> If "hierachy" is not 0, then it didn't really unmounted. If "num_cgroups"
> is not 1, then there're some cgroups not really destroyed though they've
> been rmdired.

Interesting. It looks at some point I had some cpu and cpuacct
hierarchies that never really unmounted.

#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 89 1 1
cpuacct 89 1 1
memory 0 1 1
devices 0 1 1
freezer 0 1 1
net_cls 0 1 1
blkio 0 1 1
perf_event 0 1 1
hugetlb 0 1 1

And playing a little more I get the leak scenario.

#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 90 3 1
cpuacct 90 3 1
memory 90 3 1
devices 0 1 1
freezer 90 3 1
net_cls 0 1 1
blkio 0 1 1
perf_event 0 1 1
hugetlb 0 1 1

So it definitely did not unmount.

After echo 3 > /proc/sys/vm/drop_caches

#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 90 1 1
cpuacct 90 1 1
memory 90 1 1
devices 0 1 1
freezer 90 1 1
net_cls 0 1 1
blkio 0 1 1
perf_event 0 1 1
hugetlb 0 1 1

Hmm. But after some time passes I have

#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 0 1 1
cpuacct 0 1 1
memory 0 1 1
devices 0 1 1
freezer 0 1 1
net_cls 0 1 1
blkio 0 1 1
perf_event 0 1 1
hugetlb 0 1 1

Hmm. Looking farther I see what is going on. And it has nothing to do
with the freezer. (I have commented out that code and reproduced it
without the freezer to be doubly certain).

On the exit path exit_robust_list is triggering a page fault to fault a
page back in. Which since we have no memory causes the exit path
to get stuck in mem_cgroup_handle_oom.

Which means the following change should fix the hang. I will test it in just
a second.

The problem is that we only handled pending fatal signals and exiting
processes when the OOM logic was enabled. Sigh.

Eric

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00a7a66..5998a57 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1792,16 +1792,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int points = 0;
struct task_struct *chosen = NULL;

- /*
- * If current has a pending SIGKILL or is exiting, then automatically
- * select it. The goal is to allow it to allocate so that it may
- * quickly exit and free its memory.
- */
- if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
- set_thread_flag(TIF_MEMDIE);
- return;
- }
-
check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
for_each_mem_cgroup_tree(iter, memcg) {
@@ -2220,7 +2210,15 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);

- if (need_to_kill) {
+ /*
+ * If current has a pending SIGKILL or is exiting, then automatically
+ * select it. The goal is to allow it to allocate so that it may
+ * quickly exit and free its memory.
+ */
+ if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+ set_thread_flag(TIF_MEMDIE);
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+ } else if (need_to_kill) {
finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask, order);
} else {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jingoo Han: "[PATCH 34/35] regulator: use dev_get_platdata()"
Previous message: Peter Zijlstra: "Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks"
In reply to: Li Zefan: "Re: memcg creates an unkillable task in 3.2-rc2"
Next in thread: Michal Hocko: "Re: memcg creates an unkillable task in 3.11-rc2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]