Re: [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created

From: Kirill Tkhai
Date: Mon May 29 2017 - 06:49:54 EST


On 27.05.2017 14:01, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
>
>> This patch prohibits pid allocation till child_reaper
>> of pid namespace is set, and it makes possible and safe
>> to get just unshared pid_ns from "/proc/[pid]/ns/pid_for_children"
>> file. This may be useful to determine user_ns of such a created
>> pid_ns, which is not possible now.
>>
>> It was prohibited till now, because the architecture of pid namespaces
>> assumes child reaper is the firstly created process of the namespace,
>> and it initializes pid_namespace::proc_mnt. Child reaper creation
>> mustn't race with creation of another processes from this namespace,
>> otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
>> is populated and it will get a null pointer dereference in proc_flush_task().
>> Also, child reaper mustn't die before processes from the namespace.
>
> This patch introduces the possibility that two or more processes may
> have the same pid namespace (with no processes) as pid_ns_for_children.
>
> Which means you can now have a race for the first pid in alloc_pid.
> Making it indeterminant who allocates the init process. Which is not
> acceptable.
>
> It is not acceptable on two grounds.
> 1) It is a bogus user space semantic. Because userspace needs to
> know who allocates init.
> 2) It is horrible for maintenance becuase now the code has to be very
> clever to deal with a case that no one cares about. Which is
> a general formula for buggy code.

We may disallow setns() if there is no child reaper created, and
this solves all above issues. Please see v2 below, it has no problems
you pointed.

[PATCH v2]pid_ns: Allow to get pid_for_children ns before child_reaper is created

This patch prohibits setns() on a pid namespace till its child_reaper
is set, and it makes possible and safe to get just unshared pid_ns
from "/proc/[pid]/ns/pid_for_children" file. This may be useful
to determine user_ns of such a created pid_ns, which is not possible now.

It was not possible till now, because the architecture of pid namespaces
assumes child reaper is the first created process of the namespace,
and it initializes pid_namespace::proc_mnt. Child reaper creation
mustn't race with creation of another processes from this namespace,
otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
is populated and it will get a null pointer dereference in proc_flush_task().
Also, child reaper mustn't die before processes from the namespace.

The patch prevents such races. It allows to setns() on a pid namespace
only if ns->child_reaper is already set, and this guarantees, that
only pid namespace creator may establish child reaper.
So, we can safely allow to get "/proc/[pid]/ns/pid_for_children"
since it's created, and to analyse it.

v2: Don't race for child reaper creation.

Signed-off-by: Kirill Tkhai <ktkhai@xxxxxxxxxxxxx>
CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
CC: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
CC: Oleg Nesterov <oleg@xxxxxxxxxx>
CC: Andy Lutomirski <luto@xxxxxxxxxx>
CC: Serge Hallyn <serge@xxxxxxxxxx>
CC: Michal Hocko <mhocko@xxxxxxxx>
CC: Andrei Vagin <avagin@xxxxxxxxxx>
CC: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
CC: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
CC: Ingo Molnar <mingo@xxxxxxxxxx>
CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
---
kernel/pid_namespace.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 74a5a7255b4d..5e7b3fd0d4c2 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -385,15 +385,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
}
task_unlock(task);

- if (ns) {
- read_lock(&tasklist_lock);
- if (!ns->child_reaper) {
- put_pid_ns(ns);
- ns = NULL;
- }
- read_unlock(&tasklist_lock);
- }
-
return ns ? &ns->ns : NULL;
}

@@ -428,6 +419,15 @@ static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
if (ancestor != active)
return -EINVAL;

+ /*
+ * Disallow processes to use pid namespace till its
+ * creator makes child reaper. Otherwise, several
+ * processes race for that, and it's not clear who
+ * establishes init.
+ */
+ if (!new->child_reaper)
+ return -ESRCH;
+
put_pid_ns(nsproxy->pid_ns_for_children);
nsproxy->pid_ns_for_children = get_pid_ns(new);
return 0;