LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

Sun Aug 30 14:39:59 UTC 2020

Hello everyone,

## Preliminaries

This is the summary of the Hackroom session Stéphane and I led as a follow-up
to our presentations in the Containers & Checkpoint/Restore micro-conference at
Linux Plumbers 2020.

Please make sure to see the Action Items section below as it outlines the next
concrete steps that came up during the meeting and who seemed interested in
tackling them.

The background for this summary is:

1. Stéphane's and my talk "Isolated Dynamic User Namespaces"
   People interested in the full session can watch it on YouTube:
   https://youtu.be/fSyr_IXM21Y?t=8856

2. The Hackroom session on Wednesday, 25.08.2020 at 17:00 UTC
   This session has been recorded as well. It is not yet on YouTube because
   Hackroom sessions weren't streamed. However, I plan on cutting that video
   and putting it up on YouTube as well just so there's no chance of
   miscommunication.

All people that attended session 1. were asked to send me an e-mail if they
wanted to attend session 2. to hash out details. The following people requested
to attend session 2. and were informed either through the e-mail I sent out or IRC:

Aleksa Sarai
Alexander Mihalicyn
Andy Lutomirski
Christian Brauner
Eric W. Biederman
Geoffrey Thomas
Giuseppe Scrivano
Joseph Christopher Sible
Josh Triplett
Kees Cook
Mickaël Salaün
Mrunal Patel
Pavel Tikhomirov
Sargun Dhillon
Serge Hallyn
Stephane Graber
Vivek Goyal
Wat Lim

All of them should be Cced here. In case I forgot someone don't hesitate to
forward this mail to them!

## Summary

During the Containers & Checkpoint/Restore micro-conference and in the hackroom
session Stéphane Graber and I proposed a way to make using user namespaces
simpler and more isolated. The following current problems were identified:

P1. Isolated id mappings can only be guaranteed to be locally isolated.
    A container runtime/daemon can only guarantee non-overlapping id mappings
    when no other users on the system create containers.

P2. Enforcing isolated id mappings in userspace is difficult.
    It is always possible to create other processes with overlapping id
    mappings. Coordinating id mappings in userspace will always remain
    optional. Quite a few tools nowadays (including systemd) don't care about
    /etc/sub{g,u}id and actively advise against using it. This is made even
    more problematic since sub{g,u}iid delegation is done per-user rather than
    per-container-runtime.

P3. The range of the id mapping of a container can't be predetermined.
    While POSIX mandates that a standard system should use a range of 65536 ids
    reality is very different. Some programs allocate high ids for random
    processes or for network authentication. This means, in practice it is
    often necessary to assign a range of up to 10 million ids to a container.
    This limits a system to less than 500 containers total.

P4. Isolated id mappings severely restrict the number of containers that can be
    run on a system.
    This ties back to the point about pre-determining the id range of a
    container and how large range allocations tend to be on real systems. That
    becomes even more relevant when nesting containers.

P5. Container runtimes cannot reuse overlayfs lower directories if each
    container uses isolated ID mappings, leading to either needless storage
    overhead (LXD -- though the LXD folks don’t really mind), completely
    ignoring the benefits of isolating containers from each other (Docker), or
    not using them at all (Kubernetes). (This is a more general issue but bears
    repeating since it is closely tied to most userns proposals.)

P6. Rlimits pose a problem for containers that share the same id mapping.
    This means containers with overlapping id mappings can DOS each other by
    exhausting their rlimits. The reason for this lies with the current
    implementation of rlimits -- rlimits are currently tied to users and are
    not hierarchically limited like inotify limits are. This is a severe
    problem in unprivileged workloads. Eric and others identified that this
    issue can be fixed independently of the isolated user namespace proposal.

In response to these and other issues, we made the following proposal which was
floated around in less clear form already during Linux Plumber 2019 in Lisbon
during informal discussions:

## Proposal

Introduce an in-kernel concept of an isolated user namespace by switching the
id types in the kernel from 32 to 64 bits. Userspace will only get to see the
lower 32 bits as usual. The upper 32 bits are used for a unique, in-kernel user
namespace token. The owner of such a namespace will either be the effective id
of the creator of that namespace or optionally an owning id can be set (when
created by a privileged user).

The following advantages were identified by various people during the session:

S1. An isolated user namespace has access to the full 32 bit id range.
    This makes it compatible with every Linux workload and allows to support
    post-POSIX range users that allocate high-range ids (LDAP, systemd, etc). 
    This solves P3 and P4.

S2. Kernel-enforced user namespace isolation.
    This means, there is no need for different container runtimes to
    collaborate on id ranges with immediate benefits for everyone.
    This solves P1 and P2.

S3. The need to split existing id ranges is completely removed.
    Nested containers become trivial.

S4. Simplify the usage of user namespaces significantly for newcomers.
    This should hopefully finally increase adoption in userspace especially in
    the application container and Kubernetes space.

S5. The owning id concept of a user namespace makes monitoring and interacting
    with such containers way easier.

S6. When interacting with an isolated user namespace the owning id can be used
    as a credential when interacting with the container from an ancestor user
    namespace.

The need and desire for these features seemed to be expressed by most
participating parties.

### Issues

Two main issues were discussed:

1. How are interactions across isolated user namespaces handled?
   An isolated user namespace can interact with another isolated user namespace
   or an ancestor user namespace. A good example are socket credentials. They
   can be seen and received outside of the container. In those cases the id of
   the isolated user namespace needs to be represented.
   The proposals to solve this problem were:
   1.1. Use the owning id of the isolated user namespace.
	A parent user namespace would see the configured owning id of the
	isolated user namespace (mapped to that user namespace).
        A non ancestor user namespace would see the overflow ids.
   1.2. Always use the overflow id for isolated user namespaces.
	Any other user namespaces would see the overflow id configured on the
	system.
   Proposal 1.1 semmed prefered since it would allow an unprivileged
   user creating an isolated user namespace to kill/ptrace all processes
   in the isolated namespace they spawned. In contrast proposal 1.1
   would not allow for visible ownership of the container, not just
   making tracking down the container harder but also preventing the
   owner from accessing those processes through other APIs.

   Related to this proposal it was suggested to introduce a new sysctl
   which would allow the system administrator to prevent any id
   transitions to overflow ids, i.e. a process would not be able to
   set{g,u}id() to the overflow {g,u}id.  A distribution can then decide
   to select specific overflow ids (akin to a system id) and set them
   via the already existing /proc/sys/kernel/overflow{g,u}id sysctl
   interfaces. This increases the security that isolated user namespaces
   provide even more.

2. How is filesystem access in isolated user namespaces handled?
   (This is basically the problem outlined in P5).
   There were quite a few proposals pitched by Andy and some others and it
   would be difficult to summarize them all here, especially since a few of
   them were rather rudimentary sketches. Once the YouTube video of the
   Hackroom session is up people can listen to it in more detail.

   The first consensus reached seemed to be to decouple isolated user
   namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse
   at the beginning as filesystems which can be mounted inside isolated
   user namespaces and so would have proper ownership. For mount points
   that originate from outside the namespace, everything will show as
   the overflow ids and access would be restricted to the most
   restricted permission bit for any path that can be accessed.

### Additional Requirements

Sargun pointed out that they make use of NFSv4 both id mapped, and non-id
mapped. Different id mappings between different filesystems in NFS is not part
of their use-case currently and so it is fine if the ids are passed through as
is. He additionally pointed out that they would like to be able use the
idmapper tool in such isolated containers. This tool maps a given process id to
the highest user id available. It seems that all of these use-cases would work
with the current setup.

It was proposed that for NFS an alternative solution should be considered,
namely making it possible to mount NFS inside of a user namespace. This work
would need to be done by someone well-versed in modern NFS.

### Action Items

The following consensus seemed to have been reached by the end of the session:

1. Fixing rlimits in user namespaces such that one namespace cannot affect
   another.
   This was identified as problem P6 above. During the session it seemed that
   Eric intended to investigate this.

2. Prototyping switching the kernel uid/gid types to 64bit, allowing for a
   hidden 32bit identifier and fully usable 32bit uid/gid range for the
   container.
   The consensus seemed to have been to implement a first version of this and
   do performance testing to see what the performance impact of this change
   would be.
   Aleksa Sarai and Christian Brauner stated they were interested in
   taking on this work jointly.

3. Find a way to allow setgroups() in a user namespace while keeping
   in mind the case of groups used for negative access control.
   This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
   investigate adding a prctl() to allow setgroups() to be called in a user
   namespace at the cost of restricting paths to the most restrictive
   permission. So if something is 0707 it needs to be treated as if it's 0000
   even though the caller is not in its owning group which is used for negative
   access control (how these new semantics will interact with ACLs will also
   need to be looked into).

4. Add optional enforcement that overflow uid/gid as set through
   sysctl cannot be used as regular uid/gid on the system, which will allow
   userspace to disambiguate credential IDs which are unmapped versus the
   “nobody” user (which is actually used by distributions) It seemed that this
   idea was pitched by Geoffrey Thomas.

Special thanks to Stéphane and Aleksa for corrections and additions!

Thanks!
Christian