Could not mount sysfs when enable userns but disable netns

Fri Jul 11 16:29:05 UTC 2014

"Serge E. Hallyn" <serge at hallyn.com> writes:

> Quoting chenhanxiao at cn.fujitsu.com (chenhanxiao at cn.fujitsu.com):
>> Hello,
>> 
>> How to reproduce:
>> 1. Prepare a container, enable userns and disable netns
>> 2. use libvirt-lxc to start a container
>> 3. libvirt could not mount sysfs then failed to start.
>> 
>> Then I found that
>> commit 7dc5dbc879bd0779924b5132a48b731a0bc04a1e says:
>> "Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
>> over the net namespace."
>> 
>> But why should we check sysfs mouont permission over net namespace?
>> We've already checked CAP_SYS_ADMIN though.

We already checked capable(CAP_SYS_ADMIN) and it failed.

>> What the relationship between sysfs and net namespace,
>> or this check is a little redundant?

You want a bind mount not a new fresh mount.

When looking at how evil actors could abuse things it turned out that in
some circumstances the root user (before a user namespace is created)
needs to control the policy on which filesystems may be mounted.  There
are files in sysfs and in proc that you never want to see in a chroot
jail, as they just create more surface area to attack.

The only reason for creating a new fresh mount of sysfs is to get access
to /sys/class/net.  So to keep things simple we restrict creation of
that mount to cases where the mounter has permisions over the network
namespace, and cases where nothing interesing is mounted on top of
sysfs.

If a new /sys/class/net is not needed it is possible to bind mount the
existing copy of sysfs to the new location without loss of
functionality.

> It is not redundant.  The whole point is that after clone(CLONE_NEWUSER)
> you get a newly filled set of capabilities.  But you should not have
> privileges over the host's network namesapce.  After you unshare a new
> network namespace, you *should* have privilege over it.  So the fact
> that we've already check CAP_SYS_ADMIN means nothing, because the
> capabilities need to be targeted.

Exactly the tests are failing because the caller is not the global root
and so the code is properly failing the permission checks.

Eric