[PATCHv1 7/8] cgroup: cgroup namespace setns support

Wed Oct 22 18:37:55 UTC 2014

On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <luto at amacapital.net> wrote:
> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali at google.com> wrote:
>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali at google.com> wrote:
>>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali at google.com> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto at amacapital.net> wrote:
>>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>>> <ebiederm at xmission.com> wrote:
>>>>>>>>
>>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>>> implementation.
>>>>>>>
>>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>>>
>>>>>>
>>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>>> entering the container should also be implicitly contained within the
>>>>>> cgroup-root of that container.
>>>>>
>>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>>> instead of in userspace?
>>>>>
>>>>
>>>> Userspace can do it too. Though then there will be possibility of
>>>> having processes in the same mount namespace with different
>>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>>> more complex. Thats another reason why it might not be good idea to
>>>> tie cgroups with mount namespace.
>>>>
>>>>>> Implementing pivot_cgroup_root would
>>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>>
>>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>>> to remain inside their container.
>>>>>
>>>>> So don't let them out.  None of the other namespaces have this kind of
>>>>> constraint:
>>>>>
>>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>>
>>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>>> handles in the outside namespace. I think moving a process outside of
>>>> cgroupns-root is like allocating a resource outside of your namespace.
>>>
>>> In a pidns, you can see outside tasks if you have an outside procfs
>>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>>> like that?  You wouldn't be able to escape your cgroup as long as you
>>> don't have an inappropriate cgroupfs mounted.
>>>
>>
>> I am not if we should only depend on restricted visibility for this
>> though. More details below.
>>
>>>
>>>>>
>>>>>> And with explicit permission from
>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>> semantics simple.
>>>>>
>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>> that policy.
>>>>>
>>>>
>>>> My inclination is towards keeping things simpler - both in code as
>>>> well as in configuration. I agree that cgroupns might seem
>>>> "less-flexible", but in its current form, it encourages consistent
>>>> container configuration. If you have a process that needs to move
>>>> around between cgroups belonging to different containers, then that
>>>> process should probably not be inside any container's cgroup
>>>> namespace. Allowing that will just make the cgroup namespace
>>>> pretty-much meaningless.
>>>
>>> The problem with pinning is that preventing it causes problems
>>> (specifically, either something potentially complex and incompatible
>>> needs to be added or unprivileged processes will be able to pin
>>> themselves).
>>>
>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>> need kernel pinning support to effectively constrain its members'
>>> cgroups.
>>>
>>
>> So there are 2 scenarios to consider:
>>
>> We have 2 containers with cgroups: /container1 and /container2
>> Assume process P is running under cgroupns-root '/container1'
>>
>> (1) process P wants to 'write' to cgroup.procs outside its
>> cgroupns-root (say to /container2/cgroup.procs)
>
> This, at least, doesn't have the problem with unprivileged processes
> pinning themselves.
>
>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>> with cgroupns-root above /container1) wants to write pid of process P
>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>
>> For (1), I think its ok to reject such a write. This is consistent
>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>> set. I believe this should be independent of visibility of the cgroup
>> hierarchy for P.
>>
>> For (2), we may allow the write to succeed if we make sure that the
>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>> userns AND over P's cgroupns->user_ns).
>
> Why is its userns relevant?
>
> Why not just check whether the target cgroup is in the process doing
> the write's cgroupns? (NB: you need to check f_cred, here, not
> current_cred(), but that's orthogonal.)  Then the policy becomes: no
> user of cgroupfs can move any process outside of the cgroupfs's user's
> cgroupns root.
>
Humm .. it doesn't have to be. I think its simpler to not enforce
artificial permission checks unless there is a security concern (and
in this case, there doesn't seem to be any). So I will leave the
capability check out from here.

> I think I'm okay with this.
>
>> If this write succeeds, then:
>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>> by 'self' or any other process in P's cgrgroupns. I would really like
>> to avoid showing relative paths or paths outside the cgroupns-root
>
> The empty string seems just as problematic to me.

Actually, there is no right answer here. Our options are:
* show relative path
-- this will break userspace as /proc/<pid>/cgroup does not show
relative paths today. This is also very ambiguous (is it relative to
cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

* show absolute path
-- this will also wrong as the process won't be able to make sense of
it unless it has exposure to the global cgroup hierarchy.
-- worse case is this that the global path also exists under the
cgroupns-root ... so now the process thinks its in completely wrong
cgroup
-- this exposes system

* show only "/"
-- this is arguably better, but if the process tires to verify that
its pid is in cgroup.procs of the cgroupns-root, its in for a
surprise!

In either case, whatever we expose, the userspace won't be able to use
this path correctly (worse yet, it associates wrong cgroup for that
path). So I think its best to not print out the line for default
hierarchy at all. This happens today when cgroupfs is not mounted. I
am open to other suggestions.

>
>> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
>> only able to mount and see cgroup hierarchy under its cgroupns-root
>> (d) if process P tries to write to any cgroup file outside of its
>> cgroupns-root (assuming that hierarchy is visible to it for whatever
>> reason), it will fail as in (1)
>
> I'm still unconvinced that this serves any purpose.  If you give
> DAC/MAC permission to a task to write to something, and you give it
> access to an fd or mount pointing there, and you don't want it writing
> there, then *don't do that*.  I'm not really seeing why cgroupns needs
> special treatment here.
>

There was a suggestion on the previous version of this patch-set that
we need to prevent processes inside cgroupns to not be able to modify
settings of cgroups outside of its cgroupns-root. But I agree with
your point that cgroupns should not enforce unnecessary access-control
restrictions. Its job is only to virtualize the view of
/proc/<pid>/cgroup file as much as possible (100% virtualized for a
correctly setup container). This will get rid of most of patch 6/8
"cgroup: restrict cgroup operations within task's cgroupns" of this
series. The only check we keep is in cgroup_attach_task() which
ensures that target-cgroup is descendant of current's cgroupns-root
and prevents processes from escaping their cgroupns on their own.

>>
>> i.e., in summary, you can't escape out of cgroupns-root yourself. You
>> will need help from an admin process running under some parent
>> cgroupns-root to move you out. Is that workable for your usecase? Most
>> of the things above already happen with the current patch-set, so it
>> should be easy to enable this.
>>
>> Though there are still some open issues like:
>> * what happens if you move all the processes out of /container1 and
>> then 'rmdir /container1'? As it is now, you won't be able to setns()
>> to that cgroupns anymore. But the cgroupns will still hang around
>> until the processes switch their cgroupns.
>
> Seems okay.
>
>> * should we then also allow setns() without first entering the
>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>> checks that your current cgroup is descendant of target cgroupns-root.
>> Alternatively we can special-case setns() to own cgroupns so that it
>> doesn't fail.
>
> I think setns should completely ignore the caller's cgroup and should
> not change it.  Userspace can do this.
>

All above changes more or less means that tasks cannot pin themselves
by unsharing cgroupns. Do you agree that we don't need that "explicit
permission from cgroupfs" anymore (via cgroup.may_unshare file or
other mechanism)?

>> * migration for these processes will be tricky, if not impossible. But
>> the admin trying to do this probably doesn't care about it or will
>> provision for it.
>
> Migration for processes in a mntns that have a current directory
> outside their mntns is also difficult or impossible.  Same with
> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
> procfs.  Nothing new here.
>
> --Andy

Thanks for the review!

-- 
Aditya