[PATCHv1 7/8] cgroup: cgroup namespace setns support

Aditya Kali adityakali at google.com
Wed Oct 22 18:37:55 UTC 2014

On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto at amacapital.net> wrote:
> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali at google.com> wrote:
>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali at google.com> wrote:
>>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali at google.com> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>>> <ebiederm at xmission.com> wrote:
>>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>>> implementation.
>>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>>> entering the container should also be implicitly contained within the
>>>>>> cgroup-root of that container.
>>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>>> instead of in userspace?
>>>> Userspace can do it too. Though then there will be possibility of
>>>> having processes in the same mount namespace with different
>>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>>> more complex. Thats another reason why it might not be good idea to
>>>> tie cgroups with mount namespace.
>>>>>> Implementing pivot_cgroup_root would
>>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>>> to remain inside their container.
>>>>> So don't let them out.  None of the other namespaces have this kind of
>>>>> constraint:
>>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>>> handles in the outside namespace. I think moving a process outside of
>>>> cgroupns-root is like allocating a resource outside of your namespace.
>>> In a pidns, you can see outside tasks if you have an outside procfs
>>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>>> like that?  You wouldn't be able to escape your cgroup as long as you
>>> don't have an inappropriate cgroupfs mounted.
>> I am not if we should only depend on restricted visibility for this
>> though. More details below.
>>>>>> And with explicit permission from
>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>> semantics simple.
>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>> that policy.
>>>> My inclination is towards keeping things simpler - both in code as
>>>> well as in configuration. I agree that cgroupns might seem
>>>> "less-flexible", but in its current form, it encourages consistent
>>>> container configuration. If you have a process that needs to move
>>>> around between cgroups belonging to different containers, then that
>>>> process should probably not be inside any container's cgroup
>>>> namespace. Allowing that will just make the cgroup namespace
>>>> pretty-much meaningless.
>>> The problem with pinning is that preventing it causes problems
>>> (specifically, either something potentially complex and incompatible
>>> needs to be added or unprivileged processes will be able to pin
>>> themselves).
>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>> need kernel pinning support to effectively constrain its members'
>>> cgroups.
>> So there are 2 scenarios to consider:
>> We have 2 containers with cgroups: /container1 and /container2
>> Assume process P is running under cgroupns-root '/container1'
>> (1) process P wants to 'write' to cgroup.procs outside its
>> cgroupns-root (say to /container2/cgroup.procs)
> This, at least, doesn't have the problem with unprivileged processes
> pinning themselves.
>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>> with cgroupns-root above /container1) wants to write pid of process P
>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>> For (1), I think its ok to reject such a write. This is consistent
>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>> set. I believe this should be independent of visibility of the cgroup
>> hierarchy for P.
>> For (2), we may allow the write to succeed if we make sure that the
>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>> userns AND over P's cgroupns->user_ns).
> Why is its userns relevant?
> Why not just check whether the target cgroup is in the process doing
> the write's cgroupns? (NB: you need to check f_cred, here, not
> current_cred(), but that's orthogonal.)  Then the policy becomes: no
> user of cgroupfs can move any process outside of the cgroupfs's user's
> cgroupns root.
Humm .. it doesn't have to be. I think its simpler to not enforce
artificial permission checks unless there is a security concern (and
in this case, there doesn't seem to be any). So I will leave the
capability check out from here.

> I think I'm okay with this.
>> If this write succeeds, then:
>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>> by 'self' or any other process in P's cgrgroupns. I would really like
>> to avoid showing relative paths or paths outside the cgroupns-root
> The empty string seems just as problematic to me.

Actually, there is no right answer here. Our options are:
* show relative path
-- this will break userspace as /proc/<pid>/cgroup does not show
relative paths today. This is also very ambiguous (is it relative to
cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

* show absolute path
-- this will also wrong as the process won't be able to make sense of
it unless it has exposure to the global cgroup hierarchy.
-- worse case is this that the global path also exists under the
cgroupns-root ... so now the process thinks its in completely wrong
-- this exposes system

* show only "/"
-- this is arguably better, but if the process tires to verify that
its pid is in cgroup.procs of the cgroupns-root, its in for a

In either case, whatever we expose, the userspace won't be able to use
this path correctly (worse yet, it associates wrong cgroup for that
path). So I think its best to not print out the line for default
hierarchy at all. This happens today when cgroupfs is not mounted. I
am open to other suggestions.

>> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
>> only able to mount and see cgroup hierarchy under its cgroupns-root
>> (d) if process P tries to write to any cgroup file outside of its
>> cgroupns-root (assuming that hierarchy is visible to it for whatever
>> reason), it will fail as in (1)
> I'm still unconvinced that this serves any purpose.  If you give
> DAC/MAC permission to a task to write to something, and you give it
> access to an fd or mount pointing there, and you don't want it writing
> there, then *don't do that*.  I'm not really seeing why cgroupns needs
> special treatment here.

There was a suggestion on the previous version of this patch-set that
we need to prevent processes inside cgroupns to not be able to modify
settings of cgroups outside of its cgroupns-root. But I agree with
your point that cgroupns should not enforce unnecessary access-control
restrictions. Its job is only to virtualize the view of
/proc/<pid>/cgroup file as much as possible (100% virtualized for a
correctly setup container). This will get rid of most of patch 6/8
"cgroup: restrict cgroup operations within task's cgroupns" of this
series. The only check we keep is in cgroup_attach_task() which
ensures that target-cgroup is descendant of current's cgroupns-root
and prevents processes from escaping their cgroupns on their own.

>> i.e., in summary, you can't escape out of cgroupns-root yourself. You
>> will need help from an admin process running under some parent
>> cgroupns-root to move you out. Is that workable for your usecase? Most
>> of the things above already happen with the current patch-set, so it
>> should be easy to enable this.
>> Though there are still some open issues like:
>> * what happens if you move all the processes out of /container1 and
>> then 'rmdir /container1'? As it is now, you won't be able to setns()
>> to that cgroupns anymore. But the cgroupns will still hang around
>> until the processes switch their cgroupns.
> Seems okay.
>> * should we then also allow setns() without first entering the
>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>> checks that your current cgroup is descendant of target cgroupns-root.
>> Alternatively we can special-case setns() to own cgroupns so that it
>> doesn't fail.
> I think setns should completely ignore the caller's cgroup and should
> not change it.  Userspace can do this.

All above changes more or less means that tasks cannot pin themselves
by unsharing cgroupns. Do you agree that we don't need that "explicit
permission from cgroupfs" anymore (via cgroup.may_unshare file or
other mechanism)?

>> * migration for these processes will be tricky, if not impossible. But
>> the admin trying to do this probably doesn't care about it or will
>> provision for it.
> Migration for processes in a mntns that have a current directory
> outside their mntns is also difficult or impossible.  Same with
> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
> procfs.  Nothing new here.
> --Andy

Thanks for the review!


More information about the Containers mailing list