For review: user_namespace(7) man page

Eric W. Biederman ebiederm at
Tue Sep 2 01:05:47 UTC 2014

"Michael Kerrisk (man-pages)" <mtk.manpages at> writes:

> On 08/30/2014 11:53 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages at> writes:

>>> For various reasons, my work on the namespaces man pages 
>>> fell off the table a while back. Nevertheless, the pages have
>>> been close to completion for a while now, and I recently restarted,
>>> in an effort to finish them. As you also noted to me f2f, there have
>>> been recently been some small namespace changes that you may affect
>>> the content of the pages. Therefore, I'll take the opportunity to
>>> send the namespace-related pages out for further (final?) review.
>>> So, here, I start with the user_namespaces(7) page, which is shown 
>>> in rendered form below, with source attached to this mail. I'll
>>> send various other pages in follow-on mails.
>>> Review comments/suggestions for improvements / bug fixes welcome.
>>> Cheers,
>>> Michael
>>> ==
>>> NAME
>>>        user_namespaces - overview of Linux user_namespaces
>>>        For an overview of namespaces, see namespaces(7).
>>>        User   namespaces   isolate   security-related   identifiers  and
>>>        attributes, in particular, user IDs and group  IDs  (see  creden‐
>>>        tials(7), the root directory, keys (see keyctl(2)), and capabili‐
>>>        ties (see capabilities(7)).  A process's user and group  IDs  can
>>>        be different inside and outside a user namespace.  In particular,
>>>        a process can have a normal unprivileged user ID outside  a  user
>>>        namespace while at the same time having a user ID of 0 inside the
>>>        namespace; in other words, the process has  full  privileges  for
>>>        operations  inside  the  user  namespace, but is unprivileged for
>>>        operations outside the namespace.
>>>    Nested namespaces, namespace membership
>>>        User namespaces can be nested;  that  is,  each  user  namespace—
>>>        except  the  initial  ("root") namespace—has a parent user names‐
>>>        pace, and can have zero or more child user namespaces.  The  par‐
>>>        ent user namespace is the user namespace of the process that cre‐
>>>        ates the user namespace via a call to unshare(2) or clone(2) with
>>>        the CLONE_NEWUSER flag.
>>>        The kernel imposes (since version 3.11) a limit of 32 nested lev‐
>>>        els of user namespaces.  Calls to  unshare(2)  or  clone(2)  that
>>>        would cause this limit to be exceeded fail with the error EUSERS.
>>>        Each  process  is  a  member  of  exactly  one user namespace.  A
>>>        process created via fork(2) or clone(2) without the CLONE_NEWUSER
>>>        flag  is  a  member  of the same user namespace as its parent.
>>>        A
>>            ^ single-threaded
>> Because of chroot and other things multi-threaded processes are not
>> allowed to join a user namespace.  For the documentation just saying
>> single-threaded sounds like enough here.
> Thanks. Fixed.
>>>        process can join another user namespace with setns(2) if  it  has
>>>        the  CAP_SYS_ADMIN  in  that namespace; upon doing so, it gains a
>>>        full set of capabilities in that namespace.
>>>        A call to clone(2) or  unshare(2)  with  the  CLONE_NEWUSER  flag
>>>        makes  the  new  child  process (for clone(2)) or the caller (for
>>>        unshare(2)) a member of the new user  namespace  created  by  the
>>>        call.
>>>    Capabilities
>>>        The child process created by clone(2) with the CLONE_NEWUSER flag
>>>        starts out with a complete set of capabilities in  the  new  user
>>>        namespace.  Likewise, a process that creates a new user namespace
>>>        using unshare(2)  or  joins  an  existing  user  namespace  using
>>>        setns(2)  gains a full set of capabilities in that namespace.  On
>>>        the other hand, that process has no capabilities  in  the  parent
>>>        (in  the case of clone(2)) or previous (in the case of unshare(2)
>>>        and setns(2)) user namespace, even if the new namespace  is  cre‐
>>>        ated  or  joined by the root user (i.e., a process with user ID 0
>>>        in the root namespace).
>>>        Note that a call to execve(2) will cause a process  to  lose  any
>>>        capabilities that it has, unless it has a user ID of 0 within the
>>>        namespace.  See the discussion of user  and  group  ID  mappings,
>>>        below.
>>>        A   call   to   clone(2),   unshare(2),  or  setns(2)  using  the
>>>        CLONE_NEWUSER flag sets the  "securebits"  flags  (see  capabili‐
>>>        ties(7))  to  their  default  values  (all flags disabled) in the
>>>        child (for clone(2)) or caller  (for  unshare(2),  or  setns(2)).
>>>        Note  that  because  the caller no longer has capabilities in its
>>>        original user namespace after a call to setns(2), it is not  pos‐
>>>        sible for a process to reset its "securebits" flags while retain‐
>>>        ing its user namespace membership by using  a  pair  of  setns(2)
>>>        calls  to  move  to another user namespace and then return to its
>>>        original user namespace.
>>>        Having a capability inside a user namespace permits a process  to
>>>        perform  operations  (that  require  privilege) only on resources
>>>        governed by that namespace.  The rules for determining whether or
>>>        not a process has a capability in a particular user namespace are
>>>        as follows:
>>>        1. A process has a capability inside a user namespace if it is  a
>>>           member  of  that  namespace  and  it has the capability in its
>>>           effective capability set.  A process can gain capabilities  in
>>>           its effective capability set in various ways.  For example, it
>>>           may execute a set-user-ID program or an executable with  asso‐
>>>           ciated  file  capabilities.   In  addition, a process may gain
>>>           capabilities  via  the  effect  of  clone(2),  unshare(2),  or
>>>           setns(2), as already described.
>>>        2. If a process has a capability in a user namespace, then it has
>>>           that capability in all child (and further removed  descendant)
>>>           namespaces as well.
>>>        3. When  a  user  namespace  is  created,  the kernel records the
>>>           effective user ID of the creating process as being the "owner"
>>>           of the namespace.  A process that resides in the parent of the
>>>           user namespace and whose effective user ID matches  the  owner
>>>           of  the  namespace  has all capabilities in the namespace.  By
>>>           virtue of the previous rule, this means that the  process  has
>>>           all capabilities in all further removed descendant user names‐
>>>           paces as well.
>>>    Interaction of user namespaces and other types of namespaces
>>>        Starting in Linux 3.8, unprivileged  processes  can  create  user
>>>        namespaces,  and mount, PID, IPC, network, and UTS namespaces can
>>>        be created with just the CAP_SYS_ADMIN capability in the caller's
>>>        user namespace.
>>>        If  CLONE_NEWUSER  is specified along with other CLONE_NEW* flags
>>>        in a single clone(2) or unshare(2) call, the  user  namespace  is
>>>        guaranteed  to  be  created first, giving the child (clone(2)) or
>>>        caller (unshare(2)) privileges over the remaining namespaces cre‐
>>>        ated by the call.  Thus, it is possible for an unprivileged call‐
>>>        er to specify this combination of flags.
>>>        When a new IPC, mount, network, PID, or UTS namespace is  created
>>>        via clone(2) or unshare(2), the kernel records the user namespace
>>>        of the creating process against the new namespace.  (This associ‐
>>>        ation  can't  be  changed.)   When a process in the new namespace
>>>        subsequently  performs  privileged  operations  that  operate  on
>>>        global resources isolated by the namespace, the permission checks
>>>        are performed according to the process's capabilities in the user
>>>        namespace that the kernel associated with the new namespace.
>> Restrictions on mount namespaces.
>> - A mount namespace has a owner user namespace.  A mount namespace whose
>>   owner user namespace is different than the owerner user namespace of
>>   it's parent mount namespace is considered a less privileged mount
>>   namespace.
>> - When creating a less privileged mount namespace shared mounts are
>>   reduced to slave mounts.  This ensures that mappings performed in less
>>   privileged mount namespaces will not propogate to more privielged
>>   mount namespaces.
>> - Mounts that come as a single unit from more privileged mount are
>>   locked together and may not be separated in a less privielged mount
>>   namespace.
>> - The mount flags readonly, nodev, nosuid, noexec, and the mount atime
>>   settings when propogated from a more privielged to a less privileged
>>   mount namespace become locked, and may not be changed in the less
>>   privielged mount namespace.
>> - (As of 3.18-rc1 (in todays Al Viros vfs.git#for-next tree)) A file or
>>   directory that is a mountpoint in one namespace that is not a mount
>>   point in another namespace, may be renamed, unlinked, or rmdired in
>>   the mount namespace in which it is not a mount namespace if the
>>   ordinary permission checks pass.
>>   Previously attemping to rmdir, unlink or rename a file or directory
>>   that was a mount point in another mount namespace would result in
>>   -EBUSY.  This behavior had technical problems of enforcement (nfs)
>>   and resulted in a nice denial of servial attack against more
>>   privileged users.  (Aka preventing individual files from being updated
>>   by bind mounting on top of them).
> I need some help here. What is your intention for the above text.
> Do you mean I should add it pretty much as is under a subheading
> "Restrictions on mount namespaces"?

You have the heading "Interactions of user namespaces and other types of
namespaces" and looking through the man pages you have posted for review
somewhere under that heading is the best place I could find for this
content. (Better suggestions are welcome).

My experience with working with you previously is that you tended to
reword what I had written to make the content more readable.  Perhaps
straying from exactly accurate to the practical.

So rather than try and write the perfect deathless end-user prose I
figured I would write down what the weird restrictions are on mount
namespaces when you create them in user namespaces accurately.

The test will do fine as a subsection entitled "Restrictions on mount
namespaces" as I have written it.  Or it is fine as a seed for something
better.  But those restrictions are important to document so that people
know what to expect from mount namespaces.

On a related note.  One thing that has come up recently (in 3 separate
implementations is that mount(MS_REMOUNT|...,...) must include all of
the mount flags that need to be preserved.   People creating read-only
bind mounts tend to miss that and the locked flags in mount namespaces.
That issue was flushed out now that the kernel is now not allowing most
mount flags to be cleared in mount namespaces.


More information about the Containers mailing list