[RFC PATCH 01/14] add Documentation/namespaces/user_namespace.txt

From: Serge Hallyn
Date: Tue Jul 12 2011 - 19:38:32 EST


From: Serge E. Hallyn <serge.hallyn@xxxxxxxxxxxxx>

This will hold some info about the design. Currently it contains
future todos, issues and questions.

Signed-off-by: Serge E. Hallyn <serge.hallyn@xxxxxxxxxxxxx>
Cc: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
---
Documentation/namespaces/user_namespace.txt | 93 +++++++++++++++++++++++++++
1 files changed, 93 insertions(+), 0 deletions(-)
create mode 100644 Documentation/namespaces/user_namespace.txt

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..24c894f
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,93 @@
+Description
+===========
+
+Traditionally, each task is owned by a userid (uid) and belongs to one
+or more groups (gid). Both are simple numeric ids, though userspace
+usually translates them to names. The user namespace allows tasks to
+have different views of the uids and gids associated with tasks and
+other resources.
+
+The user namespace is a simple heirarchical one. The system begins
+with all tasks belonging to the initial user namespace. A task creates
+a new user namespace by passing the CLONE_NEWUSER flag to clone(2).
+To do so, the creating task needs the CAP_SETUID, CAP_SETGID, and
+CAP_CHOWN capabilities, but does not need to be root. The clone(2)
+call will result in a new task which to the creator appears to have
+the same credentials as itself, but which sees itself as being uid
+and gid 0. Any task in or resource belonging to the initial user
+namespace will, to this new task, appear to belong to uid and gid
+-1, which is usually known as 'nobody'. Opening such files will
+result in obtaining the 'user other' permissions. UID comparisons
+will return false, and privilege will be denied.
+
+When a task belonging to userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself
+as belonging to uid 0, any task in the initial user namespace
+will see it as belonging to uid 500. Therefore, uid 500 in the
+initial user namespace will be able to kill the new task. Files
+created by the new user will (eventually) be seen by tasks in its
+own user namespace as belonging to uid 0, but to tasks in the initial
+user namespace as belonging to uid 500. Note that this userid
+mapping for the VFS is not yet implemented, though the lkml and
+containers mailing list archives will show several previous prototypes.
+In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric
+Biederman, they finally did.
+
+Other namespaces, such as UTS and network, are owned by a user
+namespace. When such a namespace is created, it is assigned to the user
+namespace by which it was created. Therefore, attempts to exercise
+privilege to resources in a network namespace can be properly validated
+by checking whether the caller has the needed privilege targeted to the
+user namespace owning the network namespace. This is called checking
+targeted capabilities, and is done using the 'ns_capable' function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace. The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices. If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace. It will be able
+to create and configure network devices.
+
+Working notes
+=============
+capable checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be
+constrained to init_user_ns.
+
+Q:
+Is accounting considered properly containerized wrt pidns? (it
+appears to be). If so, then we can change the capable check in
+kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a
+container to control those, and leave only cgroups to constrain
+the container. I'm not sure whether that is right, or whether it
+violates admin expectations.
+
+I punted on some of commoncap.c. I'm punting on xattr stuff as
+they take dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for
+some of them) target at the user_ns owning the tty. That will have
+to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices. Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns,
+unless some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and
+cap paramter. If cap is 0, then CAP_FOWNER is checked. If ns is
+NULL, we derive the ns from inode. But if ns is provided, then
+callers who need to derive inode_userns(inode) anyway can save a
+few cycles.
--
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/