Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
Serge Hallyn
serge.hallyn at canonical.com
Tue Nov 6 15:45:34 UTC 2012
Quoting Eric W. Biederman (ebiederm at xmission.com):
> Tejun Heo <tj at kernel.org> writes:
>
> > Hello, Serge.
> >
> > On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote:
> >> More practically, lacking user namespaces you can create a full (i.e.
> >> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without
> >> root. So this allows you to prevent containers from bypassing devices
> >> cgroup restrictions set by the parent. (In reality we are not using
> >> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict -
> >> but other distros do.)
> >
> > Do you even mount cgroupfs in containers? If you just bind-mount
> > cgroupfs verbatim in containers, I don't think that's gonna work very
> > well. If not, all this doesn't make any difference for containers.
> >
> > So, you don't really have any actual use case for the explicit CAP_*
> > checks, right?
>
> Having thought about this a little more I can give a definitive answer.
>
> Adding a process to the device control group is equivalent to calling
> mknod, as it allows that process to open device nodes, or equivalently
> not open device nodes. Therefore a capable check is absolutely
> required.
>
> Without a capability check it would be possible to remove access to
> /dev/console for a suid root application keeping it from reporting
> attempts to hack it for example.
>
> update_access can allow access to previously unaccessible devices
> and so is equivalent to mknod and as such requires a capability call.
>
> static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> int filetype, const char *buffer)
> {
> ....
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
>
> Likewise move to a different cgroup can give you a completely different
> set of devices you can use. And is roughly equivalent to mknod, and
> needs a capability call.
>
> static int devcgroup_can_attach(struct cgroup *new_cgrp,
> struct cgroup_taskset *set)
> {
> struct task_struct *task = cgroup_taskset_first(set);
>
> if (current != task && !capable(CAP_SYS_ADMIN))
> return -EPERM;
>
>
> The generic cgroup check in attach_task_by_pid to see if you can move
> another process into a cgroup needs to be a capability call and not a
> test for uid == 0.
>
> static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
> {
> if (pid) {
> tsk = find_task_by_vpid(pid);
>
> /*
> * even if we're attaching all tasks in the thread group, we
> * only need to check permissions on one of them.
> */
> tcred = __task_cred(tsk);
> if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
> ^^^^^^^^^^^^^^^
> !uid_eq(cred->euid, tcred->uid) &&
> !uid_eq(cred->euid, tcred->suid)) {
> rcu_read_unlock();
> ret = -EACCES;
> goto out_unlock_cgroup;
>
> Eric
(full context kept, though long, bc it's all important)
Note that part of the problem is simply that the devices cgroup is serving
as a stand-in for the lack of both user and device namespaces. If those
both existed, we could get rid of the devices cgroup. Likewise, the
presence of the devices cgroup makes a device namespace far less
compelling :) We can play games with bind mounts into /dev and devcgroup
to do most of what we want a devicens for.
-serge
More information about the Containers
mailing list