user namespace - introduction

Serge E. Hallyn serue at us.ibm.com
Mon Jun 4 12:39:57 PDT 2007


[ I've been sitting on this for some months, and am just dumping it so
people can talk if they like, maybe even build on the patchset by
adding support for more filesystems or implementing the keyring.  Or
tell me how much the approach sucks. ]



First, I point out once more that the base user namespace patchset
Cedric originally sent out really is sufficient.  We just need for users
to have different quotas, limits, and in-kernel key storage.  Signal
delivery, file controls, etc can be set up using pidspaces and separate
mount trees, using selinux policy or other lsms, and even using
ecryptfs.

But if it will be insisted upon that uid checks be enhanced, here is
a new patchset which just might satisfy everyone, and which is based
on user namespace discussions from the last year, particularly
comments by Eric Biederman and David Howells.

Below is how I think the user namespace controls would work.  The
patches that follow only touch on parts of steps 1-4.

  1. let filesystem tag inodes and superblocks with one user namespace
  2. let generic_permission - and through inode->i_op->permission, the
     fs, if it wants to be smarter - enforce user namespaces
  3. by default, inode->i_userns comes from sb->s_userns, just as is
     done in these patches.
  4. By default, if inode->i_userns != task->userns, the process gets
     treated as 'nobody'.  This is a change from my current patches and
     what is done in -lxc, where all permission is denied.  I think it
     is a far preferable behavior.  It allows read-only bind mount
     sharing among user namespaces without a silly MS_USER_NS flag.
  5. Capabilities relating to actions on subjects or objects associated
     with a user namespace are only effective for targets in the same
     user namespace as the actor.
     This *could* be changed to also work for targets in decendant
     user namespaces, but that could slow things down.
  6. Create a new keychain for user namespaces.  Two types of entries.
     The first type of entry, (user_ns 5, uid 501) means that whichever
     user has that key will be recognized in user namespace 5 as uid
     501.  Presumably, uid 501 in user_ns 5 would have started a vserver
     with a new user namespace, say user_ns 7.  He would likely want to
     give uid 0 in user_ns 7 a (user_ns 5, uid 501) key.
     The second type of key, (user_ns 5, CAP_FOWNER) gives the user
     holding the key the ability to have CAP_FOWNER in userns 5.  By
     default, uid 0 in userns 7 cannot have CAP_FOWNER in userns 5.
     (Only) a task with (userns 5, CAP_SETPCAP) can give that key to
     any user in userns 7.  The key by itself does not grant the
     capability, but allows a task with that uid which has CAP_FOWNER
     in its P set to assert it for userns 5.

  7. Eventually filesystems could begin storing global uids in inode
     xattrs on disk, and use these in inode->i_op->permission() along
     with data in the user's userns key to do global uid permission
     checking.  Really this should almost trivial to implement once
     the above has been implemented.  It could be done right in ext234
     etc, or in a small stackable fs.

-serge

Note: step 1 has been complained about bc some think it should be done
at the vfsmount level.  If you read through the whole set of steps I
think you'll see why it is not more limited.  The fs gets to decide the
real owner of a file, and despite there being one real owner, any number
of users can be made to be treated as the owner, so there is no
limitation in this approach.


More information about the Containers mailing list