Re: [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure

From: Eric W. Biederman
Date: Fri May 04 2012 - 10:10:01 EST


Mike Galbraith <efault@xxxxxx> writes:

> On Fri, 2012-05-04 at 00:55 -0700, Eric W. Biederman wrote:
>
>> CLONE_NEWUSER? I presume you have applied my latest user namespace
>> patches? Otherwise you are running completely half baked code.
>
> I Removed CLONE_NEWUSER flag.
>
>> hackbench? Which kernel are you running. Hackbench in some kernels is
>> really good at triggering cache ping-pong effects with pids, and creds.
>
> Not when pinned. 3.0 kernel without the debug stuff enabled in 3.4.git.
>
> marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench
> Running with 10*40 (== 400) tasks.
> Time: 0.868
> marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace
> Running with 10*40 (== 400) tasks.
> Time: 7.582
> marge:/usr/local/tmp/starvation # taskset -c 3 ./hackbench -namespace -all
> Running with 10*40 (== 400) tasks.
> Time: 29.677

Interesting. I guess what truly puzzles me is what serializes all of
the processes. Even synchronize_rcu should sleep and thus let other
synchronize_rcu calls run in parallel.

Did you have HZ=100 in that kernel? 400 tasks at 100Hz all serialized
somehow and then doing synchronize_rcu at a jiffy each would account
for 4 seconds. And the nsproxy certainly has a synchronize_rcu call.

The network namespace is comparatively heavy weight, at least in the
amount of code and other things it has to go through, so that would be
my prime suspect for those 29 seconds. There are 2-4 synchronize_rcu
calls needed to put the loopback device. Still we use
synchronize_rcu_expedited and that work should be out of line and all of
those calls should batch.

Mike is this something you are looking at a pursuing farther?

I want to guess the serialization comes from waiting on children to be
reaped but the namespaces are all cleaned up in exit_notify() called
from do_exit() so that theory doesn't hold water. The worst case
I can see is detach_pid from exit_signal running under the task list lock.
but nothing sleeps under that lock. :(

So I am very puzzled why the code serializes itself in a way that leads
to those long delays. Shrug.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/