[PATCHv1 0/8] CGroup Namespaces

Wed Jul 22 18:10:34 UTC 2015

Has there been further movement on CLONE_NEWCGROUP outside of this?

vb

On Sun, Oct 19, 2014 at 12:54 AM, Eric W. Biederman
<ebiederm at xmission.com> wrote:
> Aditya Kali <adityakali at google.com> writes:
>
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>
> This definitely looks like the right direction to go, and something that
> in some form or another I had been asking for since cgroups were merged.
> So I am very glad to see this work moving forward.
>
> I had hoped that we might just be able to be clever with remounting
> cgroupfs but 2 things stand in the way.
> 1) /proc/<pid>/cgroups (but proc could capture that).
> 2) providing a hard guarnatee that tasks stay within a subset of the
>    cgroup hierarchy.
>
> So I think this clearly meets the requirements for a new namespace.
>
> We need to have the discussion on chmod of files on cgroupfs.  There is
> a notion that has floated around that only systemd or only root (with
> the appropriate capabilities) should be allowed to set resource limits
> in cgroupfs.  In a practical reality that is nonsense.  If an atribute
> is properly bound in it's hiearchy it should be safe to change.
>
> Not all attributes are properly bound to hierarchy and some are or at
> least were dangerous for anyone except root to set.  So I suggest that a
> CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
> to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
> a cgroup attribute from root.
>
> That would be complimentary work, and not strictly tied the cgroup
> namespaces but unprivileged cgroup namespaces don't make much sense
> without that work.
>
> Eric
>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) Setns to another cgroup namespace is allowed only when:
>>       (a) process has CAP_SYS_ADMIN in its current userns
>>       (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>>       (c) the process's current cgroup is a descendant cgroupns-root of the
>>           target namespace.
>>       (d) the target cgroupns-root is descendant of current cgroupns-root..
>>       The last check (d) prevents processes from escaping their cgroupns-root by
>>       attaching to parent cgroupns. Thus, setns is allowed only when the process
>>       is trying to restrict itself to a deeper cgroup hierarchy.
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>>   (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>>       the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>>       The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>>       container management tools to be run inside the containers transparently.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
>
>
>
>> ---
>>  fs/kernfs/dir.c                  |  53 +++++++++---
>>  fs/kernfs/mount.c                |  48 +++++++++++
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  41 +++++++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++
>>  include/linux/kernfs.h           |   5 ++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
>>  kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 ++++-
>>  15 files changed, 518 insertions(+), 41 deletions(-)
>>  create mode 100644 include/linux/cgroup_namespace.h
>>  create mode 100644 kernel/cgroup_namespace.c
>>
>> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
>> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
>> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
>> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
>> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
>> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
>> [PATCHv1 7/8] cgroup: cgroup namespace setns support
>> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
>> _______________________________________________
>> Containers mailing list
>> Containers at lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers