Re: [PATCHv1 0/8] CGroup Namespaces

From: Aditya Kali
Date: Tue Oct 14 2014 - 19:33:38 EST


On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <adityakali@xxxxxxxxxx> wrote:
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>> your cgroupns-root.
>>
>> More details in the writeup below.
>>
>> Background
>> Cgroups and Namespaces are used together to create âvirtualâ
>> containers that isolates the host environment from the processes
>> running in container. But since cgroups themselves are not
>> âvirtualizedâ, the task is always able to see global cgroups view
>> through cgroupfs mount and via /proc/self/cgroup file.
>>
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> This exposure of cgroup names to the processes running inside a
>> container results in some problems:
>> (1) The container names are typically host-container-management-agent
>> (systemd, docker/libcontainer, etc.) data and leaking its name (or
>> leaking the hierarchy) reveals too much information about the host
>> system.
>> (2) It makes the container migration across machines (CRIU) more
>> difficult as the container names need to be unique across the
>> machines in the migration domain.
>> (3) It makes it difficult to run container management tools (like
>> docker/libcontainer, lmctfy, etc.) within virtual containers
>> without adding dependency on some state/agent present outside the
>> container.
>>
>> Note that the feature proposed here is completely different than the
>> âns cgroupâ feature which existed in the linux kernel until recently.
>> The ns cgroup also attempted to connect cgroups and namespaces by
>> creating a new cgroup every time a new namespace was created. It did
>> not solve any of the above mentioned problems and was later dropped
>> from the kernel. Incidentally though, it used the same config option
>> name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>> With unified cgroup hierarchy
>> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>> have a much more coherent cgroup view and its easy to associate a
>> container with a single cgroup. This also allows us to virtualize the
>> cgroup view for tasks inside the container.
>>
>> The new CGroup Namespace allows a process to âunshareâ its cgroup
>> hierarchy starting from the cgroup its currently in.
>> For Ex:
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>> $ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and execâs /bin/bash
>> [ns]$ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>> cgroup:[4026532183]
>> # From within new cgroupns, process sees that its in the root cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>> # From global cgroupns:
>> $ cat /proc/<pid>/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> # Unshare cgroupns along with userns and mountns
>> # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> # sets up uid/gid map and execâs /bin/bash
>> $ ~/unshare -c -u -m
>>
>> # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>> # hierarchy.
>> [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> [ns]$ ls -l /tmp/cgroup
>> total 0
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>> filesystem root for the namespace specific cgroupfs mount.
>>
>> The virtualization of /proc/self/cgroup file combined with restricting
>> the view of cgroup hierarchy by namespace-private cgroupfs mount
>> should provide a completely isolated cgroup view inside the container.
>>
>> In its current form, the cgroup namespaces patcheset provides following
>> behavior:
>>
>> (1) The ârootâ cgroup for a cgroup namespace is the cgroup in which
>> the process calling unshare is running.
>> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>> For the init_cgroup_ns, this is the real root (â/â) cgroup
>> (identified in code as cgrp_dfl_root.cgrp).
>>
>> (2) The cgroupns-root cgroup does not change even if the namespace
>> creator process later moves to a different cgroup.
>> $ ~/unshare -c # unshare cgroupns in some cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> [ns]$ mkdir sub_cgrp_1
>> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (3) Each process gets its CGROUPNS specific view of
>> /proc/<pid>/cgroup.
>> (a) Processes running inside the cgroup namespace will be able to see
>> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> [ns]$ sleep 100000 & # From within unshared cgroupns
>> [1] 7353
>> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (b) From global cgroupns, the real cgroup path will be visible:
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> This is a little weird. Not sure it's a problem.
>
>>
>> (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>> path will be visible:
>> # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>> [ns2]$ cat /proc/7353/cgroup
>> [ns2]$
>> This is same as when cgroup hierarchy is not mounted at all.
>> (In correct container setup though, it should not be possible to
>> access PIDs in another container in the first place.)
>>
>> (4) Processes inside a cgroupns are not allowed to move out of the
>> cgroupns-root. This is true even if a privileged process in global
>> cgroupns tries to move the process out of its cgroupns-root.
>>
>> # From global cgroupns
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>> # cgroupns-root for 7353 is /batchjobs/c_job_id1
>> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>> -bash: echo: write error: Operation not permitted
>>
>
>>
>> (6) When some thread from a multi-threaded process unshares its
>> cgroup-namespace, the new cgroupns gets applied to the entire
>> process (all the threads). This should be OK since
>> unified-hierarchy only allows process-level containerization. So
>> all the threads in the process will have the same cgroup. And both
>> - changing cgroups and unsharing namespaces - are protected under
>> threadgroup_lock(task).
>
> This seems odd to me. Does unsharing the cgroupns unshare for all
> tasks in the process? If not, then I think that it shouldn't change
> the cgroup either.
>

Unsharing cgorupns unshares for all tasks in the process, yes.

The cgroup changes are protected by threadgroup_lock. So it made sense
to protect cgroupns changes (unshare or setns) by the same lock as we
don't want task's cgroup to change underneath while we are changing
its cgroup-namespace. No cgroup change happens during the
unshare/setns call.

> What did you end up doing to grant permission to unshare the cgroup ns?
>

Currently the only requirement is ns_capable(cgroupns->user_ns,
CAP_SYS_ADMIN). Its possible to refine this further, but for now I
just kept it simpler. I am looking into the explicit permission check
discussed previously (https://lkml.org/lkml/2014/7/29/402), but wanted
to get this out sooner.

> --Andy

Thanks,
--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/