[0/10] User namespaces: introduction

Serge E. Hallyn serue at us.ibm.com
Fri Aug 22 18:17:31 PDT 2008

Quoting Eric W. Biederman (ebiederm at xmission.com):
> "Serge E. Hallyn" <serue at us.ibm.com> writes:
> > Hi Eric,
> >
> > so here is a start to a userns patchset trying to follow your ideas
> > about how to have user namespaces and filesystems interact.  Ignore
> > the bookkeeping crap or you'll pull your hair out.  Lots of stuff
> > remains unimplemented - i.e. chown (setattr) and proper handling of
> > capabilities.  But you can do some fun things with this patchset.
> > I.e.
> >
> > 	(log in as root)
> > 	setcap cap_sys_admin=ep ns_exec
> > 	setcap cap_sys_admin=ep usernsmount
> > 	ns_exec -U /bin/sh
> > 	ls /root (fails)
> > 	ls / (succeeds)
> > 	(log in as hallyn)
> > 	ns_exec -U /bin/sh
> > 	id
> > 		(uid=0, gid=0)
> > 	ls (fails, can't descend /home/hallyn)
> > 	usernsmount / nsid=4
> > 	ls (succeeds)
> > 	touch ab
> > 	ls -l ab
> > 		(ab is owned by root)
> > 	exit
> > 	(we're logged in as hallyn in the init_user_ns again)
> > 	ls -l ab
> > 		(ab is owned by hallyn)
> >
> > The only supported fs is ext3.  Only a few operations are supported.
> > So if, above, when we are hallyn in the init_user_ns but root in
> > the child user ns,
> > 	when we create a file, it is properly handled, so
> > 		inode->i_uid=500, but an xattr (nsid=4,uid=0) is added
> > 	when we chown the file to root, it is not properly handled,
> > 		so inode->i_uid = 0
> > it's just a matter of hooking all the places at this point.
> >
> > Capabilities remain a problem.  Right now I think capabilities will
> > need to be split up into system-wide caps, and container-safe caps.
> > So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe.
> > CAP_REBOOT may become container-safe one day, but for now is very
> > much system-wide.
> >
> > So if I'm uid 500 on the host and create a user namespace where I'm
> > uid=0, I should be able to acquire container-safe caps (perhaps
> > contingent on whether I unshared all other namespaces), but not
> > system-wide ones.  Or, whether I can acquire them would depend
> > on whether the suid bit was set in a user_ns or not.  sigh.
> Serge at first glance this looks like a good start, especially for thinking
> through how things will work.
> It has just occurred to me that from a dependency point of view it
> makes an enormous amount of sense to sort out capable with
> respect to namespaces before we get to the filesystems.
> There is no one else working in the area of capabilities so there won't
> be conflicts, and we need a firm understanding of how capabilities are
> going to work with respect to namespaces before we start embedding
> the logic in filesystems.
> With respect to your separation of capabilities in namespaces I don't think
> you have quite grasped the simple idea that is sitting in my head and makes
> all of this clear.  Let me see if I can explain it better.
> A fully qualified capability name would be of the form:
> userns:capability_name
> For each operation we will check for one specific capability.
> For the network namespace in particular we will check for:
> userns_of_network_namespace_creator:CAP_NET_ADMIN
> The check for a capability will succeed if:
> - We have the exact fully qualified capability. 
> - We are outside the user namespace but are the owner of
>   the user namespace.
> - We are outside the user namespace but have the appropriate
>   capability over the owner of the user namespace  CAP_PTRACE?
>   This last test would recurses.
> I'm less certain than I like about which permissions we allow someone outside
> of a container to posses and still control the container.
> This has two very useful implications.
> - We can have all capabilities in a new user namespace and be completely
>   impotent.
> - Allowing the capabilities of a user namespace to do something useful
>   can come gradually.
> Which means we need to extend the classic capable check to become.
> capable(userns, capability).  Or possibly we extend the capability
> parameter to be a structure that can hold both userns and the capability,
> whichever turns out to be more maintainable.
> Once we have done that we can allow something to be under the power
> of creator_user_ns:capability instead of init_user_ns:capability.
> So the CAP_SYS_REBOOT test will be init_user_ns:capability for the 
> foreseeable future.  While the CAP_NET_ADMIN test will shortly
> become creator_of_netns:CAP_NET_ADMIN.
> Of course none of that will happen until we relax the test to create a
> new namespace from init_user_ns:CAP_SYS_ADMIN to
> current_user_ns:CAP_SYS_ADMIN.
> Eric

It definately seems to make sense in terms of the security
implications.  And solving this before the filesystem handlers seems
to make sense too.  Although I would like to get the first 3 patches upstream
pretty soon, as I believe they are proper fixes.

But wrt userns:capability, the problem that brings to mind is that of
referring to the userns.  Do we use the userspace-exported id, or do we
use the actual in-kernel user_ns?  If we use the in-kernel user_ns,
then we'd have to take a ref for each cap, yuck.  But you had wanted to
use 'mount' to only have filesystems associate userspace ids with the
in-kernel struct user_ns, so that complicates the idea of having
capabilities refer to those.

Anyway I like the overall approach, and will think a bit about
any other actual implementation issues.


More information about the Containers mailing list