For review: user_namespace(7) man page

Sat Aug 30 21:53:11 UTC 2014

"Michael Kerrisk (man-pages)" <mtk.manpages at gmail.com> writes:

> Hello Eric et al.,
>
> For various reasons, my work on the namespaces man pages 
> fell off the table a while back. Nevertheless, the pages have
> been close to completion for a while now, and I recently restarted,
> in an effort to finish them. As you also noted to me f2f, there have
> been recently been some small namespace changes that you may affect
> the content of the pages. Therefore, I'll take the opportunity to
> send the namespace-related pages out for further (final?) review.
>
> So, here, I start with the user_namespaces(7) page, which is shown 
> in rendered form below, with source attached to this mail. I'll
> send various other pages in follow-on mails.
>
> Review comments/suggestions for improvements / bug fixes welcome.
>
> Cheers,
>
> Michael
>
> ==
>
> NAME
>        user_namespaces - overview of Linux user_namespaces
>
> DESCRIPTION
>        For an overview of namespaces, see namespaces(7).
>
>        User   namespaces   isolate   security-related   identifiers  and
>        attributes, in particular, user IDs and group  IDs  (see  creden‐
>        tials(7), the root directory, keys (see keyctl(2)), and capabili‐
>        ties (see capabilities(7)).  A process's user and group  IDs  can
>        be different inside and outside a user namespace.  In particular,
>        a process can have a normal unprivileged user ID outside  a  user
>        namespace while at the same time having a user ID of 0 inside the
>        namespace; in other words, the process has  full  privileges  for
>        operations  inside  the  user  namespace, but is unprivileged for
>        operations outside the namespace.
>
>    Nested namespaces, namespace membership
>        User namespaces can be nested;  that  is,  each  user  namespace—
>        except  the  initial  ("root") namespace—has a parent user names‐
>        pace, and can have zero or more child user namespaces.  The  par‐
>        ent user namespace is the user namespace of the process that cre‐
>        ates the user namespace via a call to unshare(2) or clone(2) with
>        the CLONE_NEWUSER flag.
>
>        The kernel imposes (since version 3.11) a limit of 32 nested lev‐
>        els of user namespaces.  Calls to  unshare(2)  or  clone(2)  that
>        would cause this limit to be exceeded fail with the error EUSERS.
>
>        Each  process  is  a  member  of  exactly  one user namespace.  A
>        process created via fork(2) or clone(2) without the CLONE_NEWUSER
>        flag  is  a  member  of the same user namespace as its parent.
>        A
           ^ single-threaded

Because of chroot and other things multi-threaded processes are not
allowed to join a user namespace.  For the documentation just saying
single-threaded sounds like enough here.

>        process can join another user namespace with setns(2) if  it  has
>        the  CAP_SYS_ADMIN  in  that namespace; upon doing so, it gains a
>        full set of capabilities in that namespace.
>
>        A call to clone(2) or  unshare(2)  with  the  CLONE_NEWUSER  flag
>        makes  the  new  child  process (for clone(2)) or the caller (for
>        unshare(2)) a member of the new user  namespace  created  by  the
>        call.
>
>    Capabilities
>        The child process created by clone(2) with the CLONE_NEWUSER flag
>        starts out with a complete set of capabilities in  the  new  user
>        namespace.  Likewise, a process that creates a new user namespace
>        using unshare(2)  or  joins  an  existing  user  namespace  using
>        setns(2)  gains a full set of capabilities in that namespace.  On
>        the other hand, that process has no capabilities  in  the  parent
>        (in  the case of clone(2)) or previous (in the case of unshare(2)
>        and setns(2)) user namespace, even if the new namespace  is  cre‐
>        ated  or  joined by the root user (i.e., a process with user ID 0
>        in the root namespace).
>
>        Note that a call to execve(2) will cause a process  to  lose  any
>        capabilities that it has, unless it has a user ID of 0 within the
>        namespace.  See the discussion of user  and  group  ID  mappings,
>        below.
>
>        A   call   to   clone(2),   unshare(2),  or  setns(2)  using  the
>        CLONE_NEWUSER flag sets the  "securebits"  flags  (see  capabili‐
>        ties(7))  to  their  default  values  (all flags disabled) in the
>        child (for clone(2)) or caller  (for  unshare(2),  or  setns(2)).
>        Note  that  because  the caller no longer has capabilities in its
>        original user namespace after a call to setns(2), it is not  pos‐
>        sible for a process to reset its "securebits" flags while retain‐
>        ing its user namespace membership by using  a  pair  of  setns(2)
>        calls  to  move  to another user namespace and then return to its
>        original user namespace.
>
>        Having a capability inside a user namespace permits a process  to
>        perform  operations  (that  require  privilege) only on resources
>        governed by that namespace.  The rules for determining whether or
>        not a process has a capability in a particular user namespace are
>        as follows:
>
>        1. A process has a capability inside a user namespace if it is  a
>           member  of  that  namespace  and  it has the capability in its
>           effective capability set.  A process can gain capabilities  in
>           its effective capability set in various ways.  For example, it
>           may execute a set-user-ID program or an executable with  asso‐
>           ciated  file  capabilities.   In  addition, a process may gain
>           capabilities  via  the  effect  of  clone(2),  unshare(2),  or
>           setns(2), as already described.
>
>        2. If a process has a capability in a user namespace, then it has
>           that capability in all child (and further removed  descendant)
>           namespaces as well.
>
>        3. When  a  user  namespace  is  created,  the kernel records the
>           effective user ID of the creating process as being the "owner"
>           of the namespace.  A process that resides in the parent of the
>           user namespace and whose effective user ID matches  the  owner
>           of  the  namespace  has all capabilities in the namespace.  By
>           virtue of the previous rule, this means that the  process  has
>           all capabilities in all further removed descendant user names‐
>           paces as well.
>
>    Interaction of user namespaces and other types of namespaces
>        Starting in Linux 3.8, unprivileged  processes  can  create  user
>        namespaces,  and mount, PID, IPC, network, and UTS namespaces can
>        be created with just the CAP_SYS_ADMIN capability in the caller's
>        user namespace.
>
>        If  CLONE_NEWUSER  is specified along with other CLONE_NEW* flags
>        in a single clone(2) or unshare(2) call, the  user  namespace  is
>        guaranteed  to  be  created first, giving the child (clone(2)) or
>        caller (unshare(2)) privileges over the remaining namespaces cre‐
>        ated by the call.  Thus, it is possible for an unprivileged call‐
>        er to specify this combination of flags.
>
>        When a new IPC, mount, network, PID, or UTS namespace is  created
>        via clone(2) or unshare(2), the kernel records the user namespace
>        of the creating process against the new namespace.  (This associ‐
>        ation  can't  be  changed.)   When a process in the new namespace
>        subsequently  performs  privileged  operations  that  operate  on
>        global resources isolated by the namespace, the permission checks
>        are performed according to the process's capabilities in the user
>        namespace that the kernel associated with the new namespace.

Restrictions on mount namespaces.

- A mount namespace has a owner user namespace.  A mount namespace whose
  owner user namespace is different than the owerner user namespace of
  it's parent mount namespace is considered a less privileged mount
  namespace.

- When creating a less privileged mount namespace shared mounts are
  reduced to slave mounts.  This ensures that mappings performed in less
  privileged mount namespaces will not propogate to more privielged
  mount namespaces.

- Mounts that come as a single unit from more privileged mount are
  locked together and may not be separated in a less privielged mount
  namespace.

- The mount flags readonly, nodev, nosuid, noexec, and the mount atime
  settings when propogated from a more privielged to a less privileged
  mount namespace become locked, and may not be changed in the less
  privielged mount namespace.

- (As of 3.18-rc1 (in todays Al Viros vfs.git#for-next tree)) A file or
  directory that is a mountpoint in one namespace that is not a mount
  point in another namespace, may be renamed, unlinked, or rmdired in
  the mount namespace in which it is not a mount namespace if the
  ordinary permission checks pass.

  Previously attemping to rmdir, unlink or rename a file or directory
  that was a mount point in another mount namespace would result in
  -EBUSY.  This behavior had technical problems of enforcement (nfs)
  and resulted in a nice denial of servial attack against more
  privileged users.  (Aka preventing individual files from being updated
  by bind mounting on top of them).

>    User and group ID mappings: uid_map and gid_map
>        When a user namespace is created, it starts out without a mapping
>        of user IDs (group  IDs)  to  the  parent  user  namespace.   The
>        /proc/[pid]/uid_map   and  /proc/[pid]/gid_map  files  (available
>        since Linux 3.5) expose the  mappings  for  user  and  group  IDs
>        inside  the  user namespace for the process pid.  These files can
>        be read to view the mappings in a user namespace and  written  to
>        (once) to define the mappings.
>
>        The  description in the following paragraphs explains the details
>        for uid_map; gid_map is exactly the same, but  each  instance  of
>        "user ID" is replaced by "group ID".
>
>        The  uid_map  file  exposes the mapping of user IDs from the user
>        namespace of the process pid to the user namespace of the process
>        that  opened  uid_map  (but  see  a  qualification  to this point
>        below).  In other words, processes that  are  in  different  user
>        namespaces  will  potentially  see  different values when reading
>        from a particular uid_map file, depending on the user ID mappings
>        for the user namespaces of the reading processes.
>
>        Each  line  in  the  uid_map file specifies a 1-to-1 mapping of a
>        range of contiguous user IDs between two user namespaces.   (When
>        a  user  namespace  is  first  created, this file is empty.)  The
>        specification in each line takes the form of three numbers delim‐
>        ited  by white space.  The first two numbers specify the starting
>        user ID in each of the two user  namespaces.   The  third  number
>        specifies  the length of the mapped range.  In detail, the fields
>        are interpreted as follows:
>
>        (1) The start of the range of user IDs in the user  namespace  of
>            the process pid.
>
>        (2) The  start  of  the  range  of user IDs to which the user IDs
>            specified by field one map.  How  field  two  is  interpreted
>            depends  on  whether  the process that opened uid_map and the
>            process pid are in the same user namespace, as follows:
>
>            a) If the two processes are  in  different  user  namespaces:
>               field  two is the start of a range of user IDs in the user
>               namespace of the process that opened uid_map.
>
>            b) If the two processes are in the same user namespace: field
>               two  is  the  start of the range of user IDs in the parent
>               user namespace of the process pid.  This case enables  the
>               opener  of  uid_map  (the  common  case  here  is  opening
>               /proc/self/uid_map) to see the mapping of  user  IDs  into
>               the  user  namespace of the process that created this user
>               namespace.
>
>        (3) The length of the range of user IDs that  is  mapped  between
>            the two user namespaces.
>
>        System  calls  that  return  user  IDs  (group  IDs)—for example,
>        getuid(2), getgid(2), and the credential fields in the  structure
>        returned by stat(2)—return the user ID (group ID) mapped into the
>        caller's user namespace.
>
>        When a process accesses a file, its user and group IDs are mapped
>        into  the  initial  user  namespace for the purpose of permission
>        checking and assigning IDs when creating a file.  When a  process
>        retrieves file user and group IDs via stat(2), the IDs are mapped
>        in the opposite direction, to  produce  values  relative  to  the
>        process user and group ID mappings.
>
>        The initial user namespace has no parent namespace, but, for con‐
>        sistency, the kernel provides dummy user  and  group  ID  mapping
>        files  for  this namespace.  Looking at the uid_map file (gid_map
>        is the same) from a shell in the initial namespace shows:
>
>            $ cat /proc/$$/uid_map
>                     0          0 4294967295
>
>        This mapping tells us that the range starting at  user  ID  0  in
>        this namespace maps to a range starting at 0 in the (nonexistent)
>        parent namespace, and the length of  the  range  is  the  largest
>        32-bit unsigned integer.

Which deliberately leaves 4294967295 32bit (-1) unmapped.  (uid_t)-1 is
used in several interfaces (like setreuid) as a way to specify no uid
leaving it unmapped and unusuable guarantees that there will be no
confusion when using those kernel methods.

>    Defining user and group ID mappings: writing to uid_map and gid_map
>        After  the  creation of a new user namespace, the uid_map file of
>        one of the processes in the namespace may be written to  once  to
>        define  the  mapping  of  user IDs in the new user namespace.  An
>        attempt to write more than once to  a  uid_map  file  in  a  user
>        namespace  fails  with  the error EPERM.  Similar rules apply for
>        gid_map files.
>
>        The lines written to uid_map (gid_map) must conform to  the  fol‐
>        lowing rules:
>
>        *  The  three  fields  must  be valid numbers, and the last field
>           must be greater than 0.
>
>        *  Lines are terminated by newline characters.
>
>        *  There is an (arbitrary) limit on the number of  lines  in  the
>           file.  As at Linux 3.8, the limit is five lines.  In addition,
>           the number of bytes written to the file must be less than  the
>           system page size, and the write must be performed at the start
>           of the file (i.e., lseek(2) and pwrite(2)  can't  be  used  to
>           write to nonzero offsets in the file).
>
>        *  The  range of user IDs (group IDs) specified in each line can‐
>           not overlap with the ranges in any other lines.  In  the  ini‐
>           tial  implementation  (Linux 3.8), this requirement was satis‐
>           fied by a simplistic implementation that imposed  the  further
>           requirement  that  the  values  in both field 1 and field 2 of
>           successive lines must be in ascending numerical  order,  which
>           prevented some otherwise valid maps from being created.  Linux
>           3.9 and later fix this limitation, allowing any valid  set  of
>           nonoverlapping maps.
>
>        *  At least one line must be written to the file.
>
>        Writes that violate the above rules fail with the error EINVAL.
>
>        In  order  for  a  process  to  write  to the /proc/[pid]/uid_map
>        (/proc/[pid]/gid_map) file, all  of  the  following  requirements
>        must be met:
>
>        1. The  writing  process  must  have  the CAP_SETUID (CAP_SETGID)
>           capability in the user namespace of the process pid.
>
>        2. The writing process must be in either the  user  namespace  of
>           the  process  pid  or  inside the parent user namespace of the
>           process pid.
>
>        3. The mapped user IDs (group IDs) must in turn have a mapping in
>           the parent user namespace.
>
>        4. One of the following is true:
>
>           *  The  data written to uid_map (gid_map) consists of a single
>              line that maps the writing  process's  filesystem  user  ID
>              (group ID) in the parent user namespace to a user ID (group
>              ID) in the user namespace.  The usual  case  here  is  that
>              this  single  line  provides  a  mapping for user ID of the
>              process that created the namespace.
>
>           *  The process has the CAP_SETUID (CAP_SETGID)  capability  in
>              the  parent user namespace.  Thus, a privileged process can
>              make mappings to arbitrary user IDs (group IDs) in the par‐
>              ent user namespace.
>
>        Writes that violate the above rules fail with the error EPERM.
>
>    Unmapped user and group IDs
>        There are various places where an unmapped user ID (group ID) may
>        be exposed to user space.  For example, the first  process  in  a
>        new user namespace may call getuid() before a user ID mapping has
>        been defined for the namespace.  In most such cases, an  unmapped
>        user  ID  is  converted  to  the overflow user ID (group ID); the
>        default value for the overflow user ID (group ID) is 65534.   See
>        the     descriptions    of    /proc/sys/kernel/overflowuid    and
>        /proc/sys/kernel/overflowgid in proc(5).
>
>        The cases where unmapped IDs are mapped in this  fashion  include
>        system calls that return user IDs (getuid(2) getgid(2), and simi‐
>        lar), credentials passed over a UNIX domain  socket,  credentials
>        returned  by  stat(2),  waitid(2),  and  the  System  V IPC "ctl"
>        IPC_STAT operations, credentials exposed by /proc/PID/status  and
>        the files in /proc/sysvipc/*, credentials returned via the si_uid
>        field in the siginfo_t received with a signal (see sigaction(2)),
>        credentials written to the process accounting file (see acct(5)),
>        and credentials returned with POSIX message  queue  notifications
>        (see mq_notify(3)).
>
>        There  is  one notable case where unmapped user and group IDs are
>        not converted to the corresponding overflow ID value.  When view‐
>        ing  a  uid_map  or gid_map file in which there is no mapping for
>        the second field, that field is displayed as 4294967295 (-1 as an
>        unsigned integer);
>
>    Set-user-ID and set-group-ID programs
>        When  a  process  inside  a user namespace executes a set-user-ID
>        (set-group-ID) program, the process's effective user  (group)  ID
>        inside  the  namespace is changed to whatever value is mapped for
>        the user (group) ID of the file.  However, if either the user  or
>        the group ID of the file has no mapping inside the namespace, the
>        set-user-ID (set-group-ID) bit is silently ignored: the new  pro‐
>        gram  is executed, but the process's effective user (group) ID is
>        left unchanged.  (This mirrors the semantics of executing a  set-
>        user-ID or set-group-ID program that resides on a filesystem that
>        was mounted with the MS_NOSUID flag, as described in mount(2).)
>
>    Miscellaneous
>        When a process's user and group IDs are passed over a UNIX domain
>        socket  to  a  process  in  a  different  user namespace (see the
>        description of SCM_CREDENTIALS in unix(7)), they  are  translated
>        into the corresponding values as per the receiving process's user
>        and group ID mappings.
>
> CONFORMING TO
>        Namespaces are a Linux-specific feature.
>
> NOTES
>        Over the years, there have been a lot of features that have  been
>        added  to  the Linux kernel that have been made available only to
>        privileged users because of their potential to confuse  set-user-
>        ID-root  applications.   In general, it becomes safe to allow the
>        root user in a user namespace to use those features because it is
>        impossible,  while  in  a  user namespace, to gain more privilege
>        than the root user of a user namespace has.
>
>    Availability
>        Use of user namespaces requires a kernel that is configured  with
>        the  CONFIG_USER_NS option.  User namespaces require support in a
>        range of subsystems across the kernel.  When an unsupported  sub‐
>        system  is configured into the kernel, it is not possible to con‐
>        figure user namespaces support.
>
>        As at Linux 3.8, most relevant subsystems supported  user  names‐
>        paces,  but  a number of filesystems did not have the infrastruc‐
>        ture needed to map user and group IDs  between  user  namespaces.
>        Linux  3.9  added the required infrastructure support for many of
>        the remaining unsupported filesystems (Plan 9 (9P),  Andrew  File
>        System  (AFS),  Ceph,  CIFS,  CODA,  NFS, and OCFS2).  Linux 3.11
>        added support the last of the unsupported major filesystems, XFS.
>
> EXAMPLE
>        The program below is designed to allow  experimenting  with  user
>        namespaces,  as  well  as  other types of namespaces.  It creates
>        namespaces as specified by command-line options and then executes
>        a  command  inside  those  namespaces.   The comments and usage()
>        function inside the program provide a  full  explanation  of  the
>        program.  The following shell session demonstrates its use.
>
>        First, we look at the run-time environment:
>
>            $ uname -rs     # Need Linux 3.8 or later
>            Linux 3.8.0
>            $ id -u         # Running as unprivileged user
>            1000
>            $ id -g
>            1000
>
>        Now  start a new shell in new user (-U), mount (-m), and PID (-p)
>        namespaces, with user ID (-M) and group ID (-G) 1000 mapped to  0
>        inside the user namespace:
>
>            $ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash
>
>        The  shell  has PID 1, because it is the first process in the new
>        PID namespace:
>
>            bash$ echo $$
>            1
>
>        Inside the user namespace, the shell has user and group ID 0, and
>        a full set of permitted and effective capabilities:
>
>            bash$ cat /proc/$$/status | egrep '^[UG]id'
>            Uid: 0    0    0    0
>            Gid: 0    0    0    0
>            bash$ cat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'
>            CapInh:   0000000000000000
>            CapPrm:   0000001fffffffff
>            CapEff:   0000001fffffffff
>
>        Mounting  a new /proc filesystem and listing all of the processes
>        visible in the new PID namespace shows that the shell  can't  see
>        any processes outside the PID namespace:
>
>            bash$ mount -t proc proc /proc
>            bash$ ps ax
>              PID TTY      STAT   TIME COMMAND
>                1 pts/3    S      0:00 bash
>               22 pts/3    R+     0:00 ps ax
>
>    Program source
>
>        /* userns_child_exec.c
>
>           Licensed under GNU General Public License v2 or later
>
>           Create a child process that executes a shell command in new
>           namespace(s); allow UID and GID mappings to be specified when
>           creating a user namespace.
>        */
>        #define _GNU_SOURCE
>        #include <sched.h>
>        #include <unistd.h>
>        #include <stdlib.h>
>        #include <sys/wait.h>
>        #include <signal.h>
>        #include <fcntl.h>
>        #include <stdio.h>
>        #include <string.h>
>        #include <limits.h>
>        #include <errno.h>
>
>        /* A simple error-handling function: print an error message based
>           on the value in 'errno' and terminate the calling process */
>
>        #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>                                } while (0)
>
>        struct child_args {
>            char **argv;        /* Command to be executed by child, with args */
>            int    pipe_fd[2];  /* Pipe used to synchronize parent and child */
>        };
>
>        static int verbose;
>
>        static void
>        usage(char *pname)
>        {
>            fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
>            fprintf(stderr, "Create a child process that executes a shell "
>                    "command in a new user namespace,\n"
>                    "and possibly also other new namespace(s).\n\n");
>            fprintf(stderr, "Options can be:\n\n");
>        #define fpe(str) fprintf(stderr, "    %s", str);
>            fpe("-i          New IPC namespace\n");
>            fpe("-m          New mount namespace\n");
>            fpe("-n          New network namespace\n");
>            fpe("-p          New PID namespace\n");
>            fpe("-u          New UTS namespace\n");
>            fpe("-U          New user namespace\n");
>            fpe("-M uid_map  Specify UID map for user namespace\n");
>            fpe("-G gid_map  Specify GID map for user namespace\n");
>            fpe("-z          Map user's UID and GID to 0 in user namespace\n");
>            fpe("            (equivalent to: -M '0 <uid> 1' -G '0 <gid> 1')\n");
>            fpe("-v          Display verbose messages\n");
>            fpe("\n");
>            fpe("If -z, -M, or -G is specified, -U is required.\n");
>            fpe("It is not permitted to specify both -z and either -M or -G.\n");
>            fpe("\n");
>            fpe("Map strings for -M and -G consist of records of the form:\n");
>            fpe("\n");
>            fpe("    ID-inside-ns   ID-outside-ns   len\n");
>            fpe("\n");
>            fpe("A map string can contain multiple records, separated"
>                " by commas;\n");
>            fpe("the commas are replaced by newlines before writing"
>                " to map files.\n");
>
>            exit(EXIT_FAILURE);
>        }
>
>        /* Update the mapping file 'map_file', with the value provided in
>           'mapping', a string that defines a UID or GID mapping. A UID or
>           GID mapping consists of one or more newline-delimited records
>           of the form:
>
>               ID_inside-ns    ID-outside-ns   length
>
>           Requiring the user to supply a string that contains newlines is
>           of course inconvenient for command-line use. Thus, we permit the
>           use of commas to delimit records in this string, and replace them
>           with newlines before writing the string to the file. */
>
>        static void
>        update_map(char *mapping, char *map_file)
>        {
>            int fd, j;
>            size_t map_len;     /* Length of 'mapping' */
>
>            /* Replace commas in mapping string with newlines */
>
>            map_len = strlen(mapping);
>            for (j = 0; j < map_len; j++)
>                if (mapping[j] == ',')
>                    mapping[j] = '\n';
>
>            fd = open(map_file, O_RDWR);
>            if (fd == -1) {
>                fprintf(stderr, "ERROR: open %s: %s\n", map_file,
>                        strerror(errno));
>                exit(EXIT_FAILURE);
>            }
>
>            if (write(fd, mapping, map_len) != map_len) {
>                fprintf(stderr, "ERROR: write %s: %s\n", map_file,
>                        strerror(errno));
>                exit(EXIT_FAILURE);
>            }
>
>            close(fd);
>        }
>
>        static int              /* Start function for cloned child */
>        childFunc(void *arg)
>        {
>            struct child_args *args = (struct child_args *) arg;
>            char ch;
>
>            /* Wait until the parent has updated the UID and GID mappings.
>               See the comment in main(). We wait for end of file on a
>               pipe that will be closed by the parent process once it has
>               updated the mappings. */
>
>            close(args->pipe_fd[1]);    /* Close our descriptor for the write
>                                           end of the pipe so that we see EOF
>                                           when parent closes its descriptor */
>            if (read(args->pipe_fd[0], &ch, 1) != 0) {
>                fprintf(stderr,
>                        "Failure in child: read from pipe returned != 0\n");
>                exit(EXIT_FAILURE);
>            }
>
>            /* Execute a shell command */
>
>            printf("About to exec %s\n", args->argv[0]);
>            execvp(args->argv[0], args->argv);
>            errExit("execvp");
>        }
>
>        #define STACK_SIZE (1024 * 1024)
>
>        static char child_stack[STACK_SIZE];    /* Space for child's stack */
>
>        int
>        main(int argc, char *argv[])
>        {
>            int flags, opt, map_zero;
>            pid_t child_pid;
>            struct child_args args;
>            char *uid_map, *gid_map;
>            const int MAP_BUF_SIZE = 100;
>            char map_buf[MAP_BUF_SIZE];
>            char map_path[PATH_MAX];
>
>            /* Parse command-line options. The initial '+' character in
>               the final getopt() argument prevents GNU-style permutation
>               of command-line options. That's useful, since sometimes
>               the 'command' to be executed by this program itself
>               has command-line options. We don't want getopt() to treat
>               those as options to this program. */
>
>            flags = 0;
>            verbose = 0;
>            gid_map = NULL;
>            uid_map = NULL;
>            map_zero = 0;
>            while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != -1) {
>                switch (opt) {
>                case 'i': flags |= CLONE_NEWIPC;        break;
>                case 'm': flags |= CLONE_NEWNS;         break;
>                case 'n': flags |= CLONE_NEWNET;        break;
>                case 'p': flags |= CLONE_NEWPID;        break;
>                case 'u': flags |= CLONE_NEWUTS;        break;
>                case 'v': verbose = 1;                  break;
>                case 'z': map_zero = 1;                 break;
>                case 'M': uid_map = optarg;             break;
>                case 'G': gid_map = optarg;             break;
>                case 'U': flags |= CLONE_NEWUSER;       break;
>                default:  usage(argv[0]);
>                }
>            }
>
>            /* -M or -G without -U is nonsensical */
>
>            if (((uid_map != NULL || gid_map != NULL || map_zero) &&
>                        !(flags & CLONE_NEWUSER)) ||
>                    (map_zero && (uid_map != NULL || gid_map != NULL)))
>                usage(argv[0]);
>
>            args.argv = &argv[optind];
>
>            /* We use a pipe to synchronize the parent and child, in order to
>               ensure that the parent sets the UID and GID maps before the child
>               calls execve(). This ensures that the child maintains its
>               capabilities during the execve() in the common case where we
>               want to map the child's effective user ID to 0 in the new user
>               namespace. Without this synchronization, the child would lose
>               its capabilities if it performed an execve() with nonzero
>               user IDs (see the capabilities(7) man page for details of the
>               transformation of a process's capabilities during execve()). */
>
>            if (pipe(args.pipe_fd) == -1)
>                errExit("pipe");
>
>            /* Create the child in new namespace(s) */
>
>            child_pid = clone(childFunc, child_stack + STACK_SIZE,
>                              flags | SIGCHLD, &args);
>            if (child_pid == -1)
>                errExit("clone");
>
>            /* Parent falls through to here */
>
>            if (verbose)
>                printf("%s: PID of child created by clone() is %ld\n",
>                        argv[0], (long) child_pid);
>
>            /* Update the UID and GID maps in the child */
>
>            if (uid_map != NULL || map_zero) {
>                snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
>                        (long) child_pid);
>                if (map_zero) {
>                    snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
>                    uid_map = map_buf;
>                }
>                update_map(uid_map, map_path);
>            }
>            if (gid_map != NULL || map_zero) {
>                snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
>                        (long) child_pid);
>                if (map_zero) {
>                    snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
>                    gid_map = map_buf;
>                }
>                update_map(gid_map, map_path);
>            }
>
>            /* Close the write end of the pipe, to signal to the child that we
>               have updated the UID and GID maps */
>
>            close(args.pipe_fd[1]);
>
>            if (waitpid(child_pid, NULL, 0) == -1)      /* Wait for child */
>                errExit("waitpid");
>
>            if (verbose)
>                printf("%s: terminating\n", argv[0]);
>
>            exit(EXIT_SUCCESS);
>        }
>
> SEE ALSO
>        newgidmap(1),   newuidmap(1),   clone(2),  setns(2),  unshare(2),
>        proc(5), subgid(5), subuid(5),  credentials(7),  capabilities(7),
>        namespaces(7), pid_namespaces(7)
>
>        The  kernel  source  file  Documentation/namespaces/resource-con‐
>        trol.txt.

Eric