[PATCH 10/10] sysfs: user namespaces: add ns to user_struct
Serge E. Hallyn
serue at us.ibm.com
Fri May 2 15:21:34 PDT 2008
Quoting Eric W. Biederman (ebiederm at xmission.com):
> "Serge E. Hallyn" <serue at us.ibm.com> writes:
>
> > Quoting Eric W. Biederman (ebiederm at xmission.com):
> >> "Serge E. Hallyn" <serue at us.ibm.com> writes:
> >>
> >> >> > Index: linux-mm/include/linux/sched.h
> >> >> > ===================================================================
> >> >> > --- linux-mm.orig/include/linux/sched.h
> >> >> > +++ linux-mm/include/linux/sched.h
> >> >> > @@ -598,7 +598,7 @@ struct user_struct {
> >> >> >
> >> >> > /* Hash table maintenance information */
> >> >> > struct hlist_node uidhash_node;
> >> >> > - uid_t uid;
> >> >> > + struct k_uid_t uid;
> >> >>
> >> >> If we are going to go this direction my inclination
> >> >> is to include an array of a single element in user_struct.
> >> >>
> >> >> Maybe that makes sense. I just know we need to talk about
> >> >> how a user maps into different user namespaces. As that
> >> >
> >> > My thought had been that a task belongs to several user_structs, but
> >> > each user_struct belongs to just one user namespace. Maybe as you
> >> > suggest that's not the right way to go.
> >> >
> >> > But are you ok with just sticking a user_namespace * in here for now,
> >> > and making it clear that the user_struct-user_namespace relation is yet
> >> > to be defined?
> >> >
> >> > If not that's fine, we just won't be able to clone(CLONE_NEWUSER)
> >> > until we get the relationship straightened out.
> >> >
> >> >> is a real concept that really occurs in real filesystems
> >> >> like nfsv4 and p9fs, and having infrastructure that can
> >> >> deal with the concept (even if it doesn't support it yet) would be
> >> >> useful.
> >> >
> >> > I'll have to look at 9p, bc right now I don't know what you're talking
> >> > about. Then I'll move to the containers list to discuss what the
> >> > user_struct should look like.
> >>
> >> Ok. The concept present in nfsv4 and 9p is that a user is represented
> >> by a username string instead by a numerical id. nfsv4 when it encounters
> >> a username it doesn't have a cached mapping to a uid calls out to userspace to
> >> get that mapping. 9p does something similar although I believe less general.
> >>
> >> The key point here is that we have clear precedent of a mapping from one user
> >> namespace to another in real world code. In this case nfsv4 has one user
> >> namespace (string based) and the systems that mount it have a separate
> >> user namespace (uid based).
> >>
> >> Once user namespaces are fleshed out I expect that same potential to
> >> exist. That each user namespace can have a different uid mapping for
> >> the same username string on nfsv4.
> >>
> >> >From uid we current map to a user struct. At which point things get a
> >> little odd. I think we could swing either way. Either keeping kernel
> >> user namespaces completely disjoint or allowing them to be mapped to
> >> each other.
> >>
> >> I certainly like the classic NFS case of mapping uid 0 to user nobody
> >> on a nonlocal filesystem (outside of the container in our case) so the
> >> don't accidentally do something that root only powers would otherwise
> >> allow.
> >>
> >> In general I think managing mapping tables between user namespaces is
> >> a pain in the butt and something to be avoided if you have the option.
> >> I do see a small place for it though.
> >>
> >> Eric
> >
> > No sense talking about how to relate uids+namespaces to user_structs to
> > task_structs without first laying out a few requirements. Here is the
> > list I would start with. I'm being optimistic here that we can one day
> > allow user namespaces to be unshared without privilege, and gearing the
> > requirements to that (in fact the requirements facilitate that):
> >
> > ===========================
> > Requirement:
> > ===========================
> > when uid 500 creates a new userns, then:
> > 1. uid 500 in parentns must be able to kill tasks in the container.
> > 2. uid 500 must be able to create, chown, change user_ns, delete
> > files belonging to the container.
> > 3. tasks in a container should be able to get 'user nobody'
> > access to files from outside the container (/usr ro-remount)
> > 4. uid 400 in the container created by uid 500 must not be able
> > to read files belonging to uid 400 in the parent userns
> > 5. uid 400 in the container created by uid 500 must not be able
> > to signal tasks by uid 400 in parent user_ns (*1)
> > 6. a privileged app in the container created by uid 500 must not
> > get privilege over tasks outside the container (*1)
> > 7. a privileged app in the container created by uid 500 must not
> > get privilege over files outside the container (*2)
>
> Sounds like a reasonable set of requirements.
> >
> > *1: this should be mostly impossible if we have CLONE_NEWUSER|CLONE_NEWPID
> > *2: the feasability of this depends entirely on what we do to tag fs.
> >
> > Based on that I'd say that the fancier mapping of uids between
> > containers really isn't necessary, and if needed it can always
> > be emulated using i.e. nfsv4 to do the actual mapping of container
> > uids to usernames known by the network fs.
> >
> > But we also need to decide what we're willing to do for the regular
> > container filesystem. That's where I keep getting stuck.
>
> Then lets look at this a couple of different ways.
> - Filesystems have an internal user namespace that they use
> for their permissions checks, and they have some means
> of mapping that user namespace into the kernels user namespace.
> Usually this is a one-to-one mapping. But occasionally
> in cases like nfsv4 or fat there is a more interesting
> mapping scheme going on.
>
> For filessytems on remote servers and removable media like
> usb keys this concept of the filesystems user namespace is
> seems relevant.
>
> In practice this gives us two classes of filesystems we have
> to work with.
> - Multiple user namespace aware filesystems like nfsv4.
> - Normal filesystems that do a one-to-one mapping between
> kernel uids and filesystem uids.
>
> > Do we tag each inode with a user_namespace based on some mount context?
> Last time this was discussed the sane thing appeared to be tagging the
> vfs_mount structure with the namespace of the mount. And having
> the default permission operations reference back to the mount point.
>
> For multiple namespace aware filesystems this seems especially useful.
> As this does not pollute the pure filesystem structures like the
> superblock and allows nfs do all of it's superblock merging tricks.
>
> > Do we tag some files with a persistent 'key' which uniquely identifies a
> > user in all user namespaces (and across reboots)?
>
> I think we leave that up to the filesystem or a stackable filesystem
> that rides on top of the filesystem. With a stackable filesystem
> we should be able to easily implement the vserver trick of using
> high bits of the uid to indicate which uid namespace we care about.
>
> > Do we implement a
> > new, mostly pass-through stackable fs which we mount on top of an
> > existing fs to do uid translation?
>
> Probably, for supporting older filesystems. Unless we choose to make
> a more interesting translation layer like what nfsv4 uses and just
> upgrade classic unix filesystems to use that.
>
> > Do we force the use of nfsv4?
> > Do we
> > rely on an LSM like SELinux or smack to provide fs isolation between
> > user namespaces? Do we use a new LSM that just adds security.userns
> > xattrs to all files to tag the userns?
>
> > Heck, maybe nfsv4 is the way to go.
> Let's make that our first target general solution. Something that handles
> multiple user namespaces at the filesystem level now.
>
> > Admins can either use nfsv4 for all
> > containers, or implement isolation through SELinux/Smack, or accept that
> > uid 0 in a container has access to uid 0-owned files in all namespaces
> > plus capabilities in all namespaces.
>
> As for the uid 0 in a container problem. I still would like to make
> the default filesystem permissions checks for user equality be essentially:
>
> vfs_mount.user_namespace == task.user_namespace
> inode.uid == task.uid
>
> That saves us from most of the trouble. And in particular almost immediately
> makes most of /proc and /sysfs container safe.
>
> > Note that as soon as the fs is tagged with user namespaces, then we
> > can simply have task->cap_effective apply only to tasks and files
> > in its own user_ns, so CAP_DAC_OVERRIDE in a child userns doesn't
> > grant you privilege to files owned by others in another userns. But
> > without that, CAP_KILL can be contained to tasks within your own userns,
> > but CAP_DAC_OVERRIDE in a child userns can't be contained.
>
> Right. We still need to find a way to describe the case of allowing the
> creator of the container to have access to everything the same way root
> does today.
More information about the Containers
mailing list