Thoughts on tightening up user namespace creation

Alexander Larsson alexl at
Tue Mar 8 10:05:30 UTC 2016

On mån, 2016-03-07 at 21:15 -0800, Andy Lutomirski wrote:
> Hi all-
> I think there are three main types of concerns.  First, there might
> be
> some as-yet-unknown semantic issues that would allow privilege
> escalation by users who create user namespaces and then confuse
> something else in the system.  Second, enabling user namespaces
> exposes a lot of attack surface to unprivileged users.  Third,
> allowing tasks to create user namespaces exposes the kernel to
> various
> resource exhaustion attacks that wouldn't be possible otherwise.

In my work on xdg-app i've seen some issues that I'd ideally would like
to see a solution to. They are not necessarily security
vulnerabilities, but still problems:

devpts is only mountable in a user namespace if the root user is
mapped. Possible to work around, but ugly.

There is no way to recursively apply mount flags. For example, I often
want to recursively bind mount some directory from the host but with
MS_READONLY|MS_NODEV.  I cannot apply the flags in the MS_BIND|MS_REC
mount, so instead i have to first bind mount and then remount. However,
the remount is not recursive, so i have to manually parse
/proc/self/mountinfo and figure out all the submounts that were added.
Also, I have to manually avoid trying to remount covered mounts,
because I can't reach those, and for each remount I have to parse out
its current flags so i don't accidentally unset some set flag, causing

Mount flags are not applied on propagated mounts. Even if I do all the
stuff above, if i get a *new* mount propagated into my namespace, or if
a parent unmount is propagated uncovering an mount in my namespace,
then this new mountpoint is not read-only. This has no workaround that
I'm currently aware of.

Abstract unix domain sockets are tied to the network namespace. I
understand where this comes from, socket syscalls are "networkish".
However, the non-abstract unix domain sockets are under the control of
the filesystem namespace, and I can fully control them when setting up
the sandbox. But, as long as the sandbox share the network namespace
with the host (which is likely for desktop apps) it will have full
access to all services listening on abstract sockets on the host. This
is particularly problematic because 1) abstract sockets have no file
permissions, so any Xserver running on the host is wide open, 2)
Whether a connect call uses abstract sockets is not detectable via
seccomp, so we can't filter it in any other way. I don't know how sever
this is, as it depends on how trusty the individual services are but at
least on my system "grep @ /proc/net/unix" lists session dbus
instances, X server, and some iSCSI thing.

/proc (even the limited pid namespace one) contains a lot of old cruft
that at a minimum leaks hardware info to the sandbox, and could
potentially do worse (/proc/sysrq-trigger anyone?). I'd like to be able
to mount a "clean" /proc that has only the process-related stuff.

> +++ What does the privilege of creating a user namespace entail? +++
> It might be more interesting to allow a task to unshare all
> namespaces, hold all capabilities in them, but to still be unable to
> use certain privileged facilities.  For example, maybe denying
> administrative control over iptables, creation of exotic network
> interface types, or similar would make sense.  

> I don't know how we'd specify this type of constraint.

I think this particular issue is the main problem here. Unless we add
some very course bit-flags that specify the constraints it is going to
be a very complex API to set up such constraints. Adding course bit-
flags essentially means adding new capabilities (maybe subsetting
existing ones). Given how hard it is to understand how all the current
capabilities interact and how they can be exploited I'm not sure this
is a great idea.

Maybe we can use the LSM framework to model the constraints? For
instance, the user could be allowed to create user namespaces, but they
processes in it automatically get some selinux context applied. Then
that selinux context could be configured to limit access to certain

> +++ Who can create user namespaces (possibly with restrictions)? +++
> I can think of a few formulations.
> A simpler approach would be to add a per-namespace setting listing
> users and/or groups that can unshare their userns.  A userns starts
> out allowing everyone to unshare userns, and anyone with
> can change the setting.

This sounds like a cgroup controller to me. It makes sense for my
usecase (i.e. sandboxed desktop apps). You want to give all processes
in the users login session access to user namespaces, but not necessary
to e.g. a service or background process or a cron job running as that

> A fancier approach would be to have an fd that represents the right
> to
> unshare your userns.  Some privilege broker could give out those fds
> to apps that need them and meet whatever criteria are set.  If you
> try
> to unshare your userns without the fd, it falls back to some simpler
> policy.

In practice though, how would the privilege broken know and apply the
criteria. Its not even got the information the kernel has (such as
race-free access to the peer cgroup).

 Alexander Larsson                                            Red Hat, Inc 
       alexl at            alexander.larsson at 
He's an ungodly devious paramedic on his last day in the job. She's a 
sharp-shooting cigar-chomping archaeologist married to the Mob. They 
fight crime! 

More information about the Containers mailing list