[PATCHv1 0/8] CGroup Namespaces

Tue Oct 14 23:33:11 UTC 2014

On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski <luto at amacapital.net> wrote:
> On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <adityakali at google.com> wrote:
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> This is a little weird.  Not sure it's a problem.
>
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>
> This seems odd to me.  Does unsharing the cgroupns unshare for all
> tasks in the process?  If not, then I think that it shouldn't change
> the cgroup either.
>

Unsharing cgorupns unshares for all tasks in the process, yes.

The cgroup changes are protected by threadgroup_lock. So it made sense
to protect cgroupns changes (unshare or setns) by the same lock as we
don't want task's cgroup to change underneath while we are changing
its cgroup-namespace. No cgroup change happens during the
unshare/setns call.

> What did you end up doing to grant permission to unshare the cgroup ns?
>

Currently the only requirement is ns_capable(cgroupns->user_ns,
CAP_SYS_ADMIN). Its possible to refine this further, but for now I
just kept it simpler. I am looking into the explicit permission check
discussed previously (https://lkml.org/lkml/2014/7/29/402), but wanted
to get this out sooner.

> --Andy

Thanks,
-- 
Aditya