[PATCH 10/10] sysfs: user namespaces: add ns to user_struct

Eric W. Biederman ebiederm at xmission.com
Wed Apr 30 15:13:53 PDT 2008


"Serge E. Hallyn" <serue at us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm at xmission.com):
>> "Serge E. Hallyn" <serue at us.ibm.com> writes:
>> 
>> >> > Index: linux-mm/include/linux/sched.h
>> >> > ===================================================================
>> >> > --- linux-mm.orig/include/linux/sched.h
>> >> > +++ linux-mm/include/linux/sched.h
>> >> > @@ -598,7 +598,7 @@ struct user_struct {
>> >> >  
>> >> >  	/* Hash table maintenance information */
>> >> >  	struct hlist_node uidhash_node;
>> >> > -	uid_t uid;
>> >> > +	struct k_uid_t uid;
>> >> 
>> >> If we are going to go this direction my inclination
>> >> is to include an array of a single element in user_struct.
>> >> 
>> >> Maybe that makes sense.  I just know we need to talk about
>> >> how a user maps into different user namespaces.  As that
>> >
>> > My thought had been that a task belongs to several user_structs, but
>> > each user_struct belongs to just one user namespace.  Maybe as you
>> > suggest that's not the right way to go.
>> >
>> > But are you ok with just sticking a user_namespace * in here for now,
>> > and making it clear that the user_struct-user_namespace relation is yet
>> > to be defined?
>> >
>> > If not that's fine, we just won't be able to clone(CLONE_NEWUSER)
>> > until we get the relationship straightened out.
>> >
>> >> is a real concept that really occurs in real filesystems
>> >> like nfsv4 and p9fs, and having infrastructure that can
>> >> deal with the concept (even if it doesn't support it yet) would be
>> >> useful.
>> >
>> > I'll have to look at 9p, bc right now I don't know what you're talking
>> > about.  Then I'll move to the containers list to discuss what the
>> > user_struct should look like.
>> 
>> Ok.  The concept present in nfsv4 and 9p is that a user is represented
>> by a username string instead by a numerical id.  nfsv4 when it encounters
>> a username it doesn't have a cached mapping to a uid calls out to userspace to
>> get that mapping.  9p does something similar although I believe less general.
>> 
>> The key point here is that we have clear precedent of a mapping from one user
>> namespace to another in real world code.  In this case nfsv4 has one user
>> namespace (string based) and the systems that mount it have a separate
>> user namespace (uid based).
>> 
>> Once user namespaces are fleshed out I expect that same potential to
>> exist.  That each user namespace can have a different uid mapping for
>> the same username string on nfsv4.
>> 
>> >From uid we current map to a user struct.  At which point things get a
>> little odd.  I think we could swing either way.  Either keeping kernel
>> user namespaces completely disjoint or allowing them to be mapped to
>> each other.
>> 
>> I certainly like the classic NFS case of mapping uid 0 to user nobody
>> on a nonlocal filesystem (outside of the container in our case) so the
>> don't accidentally do something that root only powers would otherwise
>> allow.
>> 
>> In general I think managing mapping tables between user namespaces is
>> a pain in the butt and something to be avoided if you have the option.
>> I do see a small place for it though.
>> 
>> Eric
>
> No sense talking about how to relate uids+namespaces to user_structs to
> task_structs without first laying out a few requirements.  Here is the
> list I would start with.  I'm being optimistic here that we can one day
> allow user namespaces to be unshared without privilege, and gearing the
> requirements to that (in fact the requirements facilitate that):
>
> ===========================
> Requirement:
> ===========================
> when uid 500 creates a new userns, then:
>     1. uid 500 in parentns must be able to kill tasks in the container.
>     2. uid 500 must be able to create, chown, change user_ns, delete
>         files belonging to the container.
>     3. tasks in a container should be able to get 'user nobody'
>         access to files from outside the container (/usr ro-remount)
>     4. uid 400 in the container created by uid 500 must not be able
>         to read files belonging to uid 400 in the parent userns
>     5. uid 400 in the container created by uid 500 must not be able
>         to signal tasks by uid 400 in parent user_ns (*1)
>     6. a privileged app in the container created by uid 500 must not
>         get privilege over tasks outside the container (*1)
>     7. a privileged app in the container created by uid 500 must not
>         get privilege over files outside the container (*2)

Sounds like a reasonable set of requirements.
>
> *1: this should be mostly impossible if we have CLONE_NEWUSER|CLONE_NEWPID
> *2: the feasability of this depends entirely on what we do to tag fs.
>
> Based on that I'd say that the fancier mapping of uids between
> containers really isn't necessary, and if needed it can always
> be emulated using i.e. nfsv4 to do the actual mapping of container
> uids to usernames known by the network fs.
>
> But we also need to decide what we're willing to do for the regular
> container filesystem.  That's where I keep getting stuck.

Then lets look at this a couple of different ways.
- Filesystems have an internal user namespace that they use
  for their permissions checks, and they have some means
  of mapping that user namespace into the kernels user namespace.
  Usually this is a one-to-one mapping.  But occasionally
  in cases like nfsv4 or fat there is a more interesting
  mapping scheme going on.

  For filessytems on remote servers and removable media like
  usb keys this concept of the filesystems user namespace is
  seems relevant.

In practice this gives us two classes of filesystems we have
to work with.
- Multiple user namespace aware filesystems like nfsv4.
- Normal filesystems that do a one-to-one mapping between
  kernel uids and filesystem uids.

> Do we tag each inode with a user_namespace based on some mount context?
Last time this was discussed the sane thing appeared to be tagging the
vfs_mount structure with the namespace of the mount.  And having
the default permission operations reference back to the mount point.

For multiple namespace aware filesystems this seems especially useful.
As this does not pollute the pure filesystem structures like the
superblock and allows nfs do all of it's superblock merging tricks.

> Do we tag some files with a persistent 'key' which uniquely identifies a
> user in all user namespaces (and across reboots)?

I think we leave that up to the filesystem or a stackable filesystem
that rides on top of the filesystem.  With a stackable filesystem
we should be able to easily implement the vserver trick of using
high bits of the uid to indicate which uid namespace we care about.

> Do we implement a
> new, mostly pass-through stackable fs which we mount on top of an
> existing fs to do uid translation?

Probably, for supporting older filesystems.  Unless we choose to make
a more interesting translation layer like what nfsv4 uses and just
upgrade classic unix filesystems to use that.

> Do we force the use of nfsv4? 
> Do we
> rely on an LSM like SELinux or smack to provide fs isolation between
> user namespaces?  Do we use a new LSM that just adds security.userns
> xattrs to all files to tag the userns?

> Heck, maybe nfsv4 is the way to go.
Let's make that our first target general solution.  Something that handles
multiple user namespaces at the filesystem level now.

> Admins can either use nfsv4 for all
> containers, or implement isolation through SELinux/Smack, or accept that
> uid 0 in a container has access to uid 0-owned files in all namespaces
> plus capabilities in all namespaces.

As for the uid 0 in a container problem.  I still would like to make
the default filesystem permissions checks for user equality be essentially:

vfs_mount.user_namespace == task.user_namespace
inode.uid == task.uid

That saves us from most of the trouble.  And in particular almost immediately
makes most of /proc and /sysfs container safe.

> Note that as soon as the fs is tagged with user namespaces, then we
> can simply have task->cap_effective apply only to tasks and files
> in its own user_ns, so CAP_DAC_OVERRIDE in a child userns doesn't
> grant you privilege to files owned by others in another userns.  But
> without that, CAP_KILL can be contained to tasks within your own userns,
> but CAP_DAC_OVERRIDE in a child userns can't be contained.

Right.  We still need to find a way to describe the case of allowing the
creator of the container to have access to everything the same way root
does today.

Eric


More information about the Containers mailing list