[PATCH v2 00/28] user_namespace: introduce fsid mappings

Stéphane Graber stgraber at ubuntu.com
Mon Feb 17 21:57:12 UTC 2020

On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <
James.Bottomley at hansenpartnership.com> wrote:

> On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
> [...]
> > With this patch series we simply introduce the ability to create fsid
> > mappings that are different from the id mappings of a user namespace.
> > The whole feature set is placed under a config option that defaults
> > to false.
> >
> > In the usual case of running an unprivileged container we will have
> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > correspond to this id mapping, i.e. all files which we want to appear
> > as 0:0 inside the user namespace will be chowned to 100000:100000 on
> > the host. This works, because whenever the kernel needs to do a
> > filesystem access it will lookup the corresponding uid and gid in the
> > idmapping tables of the container.
> > Now think about the case where we want to have an id mapping of 0
> > 100000 100000 but an on-disk mapping of 0 300000 100000 which is
> > needed to e.g. share a single on-disk mapping with multiple
> > containers that all have different id mappings.
> > This will be problematic. Whenever a filesystem access is requested,
> > the kernel will now try to lookup a mapping for 300000 in the id
> > mapping tables of the user namespace but since there is none the
> > files will appear to be owned by the overflow id, i.e. usually
> > 65534:65534 or nobody:nogroup.
> >
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping
> > exists, the corresponding files will have correct ownership.
> How do we parametrise this new fsid shift for the unprivileged use
> case?  For newuidmap/newgidmap, it's easy because each user gets a
> dedicated range and everything "just works (tm)".  However, for the
> fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool
> has to know not only your allocated uid/gid chunk, but also the offset
> map of the image.  The former is easy, but the latter is going to vary
> by the actual image ... well unless we standardise some accepted shift
> for images and it simply becomes a known static offset.

For unprivileged runtimes, I would expect images to be unshifted and be
unpacked from within a userns. So your unprivileged user would be allowed
a uid/gid range through /etc/subuid and /etc/subgid and allowed to use
them through newuidmap/newgidmap.In that namespace, you can then pull
and unpack any images/layers you may want and the resulting fs tree will
look correct from within that namespace.

All that is possible today and is how for example unprivileged LXC works
right now.

What this patchset then allows is for containers to have differing
uid/gid maps while still being based off the same image or layers.
In this scenario, you would carve a subset of your main uid/gid map for
each container you run and run them in a child user namespace while
setting up a fsuid/fsgid map such that their filesystem access do not
follow their uid/gid map. This then results in proper isolation for
processes, networks, ... as everything runs as different kuid/kgid but
the VFS view will be the same in all containers.

Shared storage between those otherwise isolated containers would also
work just fine by simply bind-mounting the same path into two or more

Now one additional thing that would be safe for a setuid wrapper to
allow would be for arbitrary mapping of any of the uid/gid that the user
owns to be used within the fsuid/fsgid map. One potential use for this
would be to create any number of user namespaces, each with their own
mapping for uid 0 while still having all VFS access be mapped to the
user that spawned them (say uid=1000, gid=1000).

Note that in our case, the intended use for this is from a privileged
where our images would be unshifted as would be the container storage
and any shared storage for containers. The security model effectively
on properly configured filesystem permissions and mount namespaces such
that the content of those paths can never be seen by anyone but root outside
of those containers (and therefore avoids all the issues around

We will then be able to allocate distinct, random, ranges of 65536
uids/gids (or more)
for each container without ever having to do any uid/gid shifting at the
filesystem layer
or run into issues when having to setup shared storage between containers
or attaching
external storage volumes to those containers.



