[PATCH v15 0/9] open: introduce openat2(2) syscall

> This patchset is being developed here:
>   <https://github.com/cyphar/linux/tree/openat2/master>
> Patch changelog:
>  v15:
>   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
>   * Split out patches for each individual LOOKUP flag.
>   * Reword commit messages to give more background information about the
>     series, as well as mention the semantics of each flag in more detail.
> For a very long time, extending openat(2) with new features has been
> incredibly frustrating. This stems from the fact that openat(2) is
> possibly the most famous counter-example to the mantra "don't silently
> accept garbage from userspace" -- it doesn't check whether unknown flags
> are present[1].
> This means that (generally) the addition of new flags to openat(2) has
> been fraught with backwards-compatibility issues (O_TMPFILE has to be
> defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
> kernels gave errors, since it's insecure to silently ignore the
> flag[2]). All new security-related flags therefore have a tough road to
> being added to openat(2).
> Furthermore, the need for some sort of control over VFS's path resolution (to
> avoid malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a revival
> of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
> Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
> project[5]) with a few additions and changes made based on the previous
> discussion within [6] as well as others I felt were useful.
> In line with the conclusions of the original discussion of AT_NO_JUMPS, the
> flag has been split up into separate flags. However, instead of being an
> openat(2) flag it is provided through a new syscall openat2(2) which provides
> several other improvements to the openat(2) interface (see the patch
> description for more details). The following new LOOKUP_* flags are added:
>   * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
>     or through absolute links). Absolute pathnames alone in openat(2) do not
>     trigger this. Magic-link traversal which implies a vfsmount jump is also
>     blocked (though magic-link jumps on the same vfsmount are permitted).
>   * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
>     links. This is done by blocking the usage of nd_jump_link() during
>     resolution in a filesystem. The term "magic-links" is used to match
>     with the only reference to these links in Documentation/, but I'm
>     happy to change the name.
>     It should be noted that this is different to the scope of
>     ~LOOKUP_FOLLOW in that it applies to all path components. However,
>     you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
>     will *not* fail (assuming that no parent component was a
>     magic-link), and you will have an fd for the magic-link.
>     In order to correctly detect magic-links, the introduction of a new
>     LOOKUP_MAGICLINK_JUMPED state flag was required.
>   * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
>     tree, using techniques such as ".." or absolute links. Absolute
>     paths in openat(2) are also disallowed. Conceptually this flag is to
>     ensure you "stay below" a certain point in the filesystem tree --
>     but this requires some additional to protect against various races
>     that would allow escape using "..".
>     Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
>     can trivially beam you around the filesystem (breaking the
>     protection). In future, there might be similar safety checks done as
>     in LOOKUP_IN_ROOT, but that requires more discussion.
> In addition, two new flags are added that expand on the above ideas:
>   * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
>     resolution is allowed at all, including magic-links. Just as with
>     LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
>     fd for the symlink as long as no parent path had a symlink
>     component.
>   * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
>     blocking attempts to move past the root, forces all such movements
>     to be scoped to the starting point. This provides chroot(2)-like
>     protection but without the cost of a chroot(2) for each filesystem
>     operation, as well as being safe against race attacks that chroot(2)
>     is not.
>     If a race is detected (as with LOOKUP_BENEATH) then an error is
>     generated, and similar to LOOKUP_BENEATH it is not permitted to cross
>     magic-links with LOOKUP_IN_ROOT.
>     The primary need for this is from container runtimes, which
>     currently need to do symlink scoping in userspace[7] when opening
>     paths in a potentially malicious container. There is a long list of
>     CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
>     (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
>     CVE-2019-5736, just to name a few).
> In order to make all of the above more usable, I'm working on
> libpathrs[8] which is a C-friendly library for safe path resolution. It
> features a userspace-emulated backend if the kernel doesn't support
> openat2(2). Hopefully we can get userspace to switch to using it, and
> thus get openat2(2) support for free once it's ready.
> Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
> possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
> though stale NFS handles).
> [1]: https://lwn.net/Articles/588444/
> [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
> [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
> [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com
> [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com
> [6]: https://lwn.net/Articles/723057/
> [7]: https://github.com/cyphar/filepath-securejoin
> [8]: https://github.com/openSUSE/libpathrs
> The current draft of the openat2(2) man-page is included below.
> --8<---------------------------------------------------------------------------
> OPENAT2(2)                          Linux Programmer's Manual                          OPENAT2(2)
>        openat2 - open and possibly create a file (extended)
>        #include <sys/types.h>
>        #include <sys/stat.h>
>        #include <fcntl.h>
>        int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
>        Note: There is no glibc wrapper for this system call; see NOTES.
>        The  openat2()  system  call  opens the file specified by pathname.  If the specified file
>        does not exist, it may optionally (if O_CREAT is specified in  how.flags)  be  created  by
>        openat2().
>        As  with openat(2), if pathname is relative, then it is interpreted relative to the direc-
>        tory referred to by the file descriptor dirfd (or the current  working  directory  of  the
>        calling  process,  if dirfd is the special value AT_FDCWD.)  If pathname is absolute, then
>        dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case  pathname  is
>        resolved relative to dirfd.)
>        The  openat2()  system  call  is  an extension of openat(2) and provides a superset of its
>        functionality.  Rather than taking a single flag argument, an extensible  structure  (how)
>        is  passed  instead  to  allow  for  future extensions.  size must be set to sizeof(struct
>        open_how), to facilitate future extensions (see the "Extensibility" section of  the  NOTES
>        for more detail on how extensions are handled.)
>    The open_how structure
>        The following structure indicates how pathname should be opened, and acts as a superset of
>        the flag and mode arguments to openat(2).
>            struct open_how {
>                __aligned_u64 flags;         /* O_* flags. */
>                __u16         mode;          /* Mode for O_{CREAT,TMPFILE}. */
>                __u16         __padding[3];  /* Must be zeroed. */
>                __aligned_u64 resolve;       /* RESOLVE_* flags. */
>            };
>        Any future extensions to openat2() will be implemented as new fields appended to the above
>        structure (or through reuse of pre-existing padding space), with the zero value of the new
>        fields acting as though the extension were not present.
>        The meaning of each field is as follows:
>               flags
>                      The file creation and status flags to use for this operation.   All  of  the
>                      O_* flags defined for openat(2) are valid openat2() flag values.
>                      Unlike openat(2), it is an error to provide openat2() unknown or conflicting
>                      flags in flags.
>               mode
>                      File mode for the new file, with identical semantics to the mode argument to
>                      openat(2).   However,  unlike openat(2), it is an error to provide openat2()
>                      with a mode which contains bits other than 0777.
>                      It is an error to provide openat2() a non-zero mode if flags does  not  con-
>                      tain O_CREAT or O_TMPFILE.
>               resolve
>                      Change  how  the  components  of pathname will be resolved (see path_resolu-
>                      tion(7) for background information.)  The primary use case for  these  flags
>                      is  to  allow trusted programs to restrict how untrusted paths (or paths in-
>                      side untrusted directories) are resolved.  The full list of resolve flags is
>                      given below.
>                      RESOLVE_NO_XDEV
>                             Disallow  traversal of mount points during path resolution (including
>                             all bind mounts).
>                             Users of this flag are encouraged to make its use  configurable  (un-
>                             less  it is used for a specific security purpose), as bind mounts are
>                             very widely used by end-users.  Setting this flag indiscrimnately for
>                             all  uses  of  openat2() may result in spurious errors on previously-
>                             functional systems.
>                      RESOLVE_NO_SYMLINKS
>                             Disallow resolution of symbolic links during path  resolution.   This
>                             option implies RESOLVE_NO_MAGICLINKS.
>                             If the trailing component is a symbolic link, and flags contains both
>                             O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
>                             symbolic link will be returned.
>                             Users  of  this flag are encouraged to make its use configurable (un-
>                             less it is used for a specific security purpose), as  symbolic  links
>                             are very widely used by end-users.  Setting this flag indiscrimnately
>                             for all uses of openat2() may result in  spurious  errors  on  previ-
>                             ously-functional systems.
>                      RESOLVE_NO_MAGICLINKS
>                             Disallow all magic link resolution during path resolution.
>                             If  the  trailing  component is a magic link, and flags contains both
>                             O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
>                             magic link will be returned.
>                             Magic-links  are  symbolic  link-like  objects  that are most notably
>                             found   in   proc(5)   (examples    include    /proc/[pid]/exe    and
>                             /proc/[pid]/fd/*.)   Due to the potential danger of unknowingly open-
>                             ing these magic links, it may be  preferable  for  users  to  disable
>                             their resolution entirely (see symboliclink(7) for more details.)
>                      RESOLVE_BENEATH
>                             Do  not permit the path resolution to succeed if any component of the
>                             resolution is not a descendant of the directory indicated  by  dirfd.
>                             This results in absolute symbolic links (and absolute values of path-
>                             name) to be rejected.
>                             Currently, this flag also disables magic link  resolution.   However,
>                             this  may change in the future.  The caller should explicitly specify
>                             RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
>                      RESOLVE_IN_ROOT
>                             Treat dirfd as the root directory while resolving pathname (as though
>                             the user called chroot(2) with dirfd as the argument.)  Absolute sym-
>                             bolic links and ".." path components will be  scoped  to  dirfd.   If
>                             pathname is an absolute path, it is also treated relative to dirfd.
>                             However,  unlike  chroot(2) (which changes the filesystem root perma-
>                             nently for a process), RESOLVE_IN_ROOT  allows  a  program  to  effi-
>                             ciently  restrict  path  resolution  for only certain operations.  It
>                             also has several hardening features (such detecting  escape  attempts
>                             during ..  resolution) which chroot(2) does not.
>                             Currently,  this  flag also disables magic link resolution.  However,
>                             this may change in the future.  The caller should explicitly  specify
>                             RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
>                      It is an error to provide openat2() unknown flags in resolve.
>        On success, a new file descriptor is returned.  On error, -1 is returned, and errno is set
>        appropriately.
>        The set of errors returned by openat2() includes all of the errors returned by  openat(2),
>        as well as the following additional errors:
>        EINVAL An unknown flag or invalid value was specified in how.
>        EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
>        EINVAL size was smaller than any known version of struct open_how.
>        E2BIG  An  extension  was specified in how, which the current kernel does not support (see
>               the "Extensibility" section of the NOTES for more detail on how extensions are han-
>               dled.)
>        EAGAIN resolve  contains  either  RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
>               not ensure that a ".." component didn't escape (due to a race condition  or  poten-
>               tial attack.)  Callers may choose to retry the openat2() call.
>        EXDEV  resolve  contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
>               root during path resolution was detected.
>        EXDEV  resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross  a  mount
>               point.
>        ELOOP  resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
>               link (or magic link).
>        ELOOP  resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a  magic
>               link.
>        openat2() was added to Linux in kernel 5.FOO.
>        This system call is Linux-specific.
>        The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
>        Glibc does not provide a wrapper for this system call; call it using systemcall(2).
>    Extensibility
>        In order to allow for struct open_how to be extended in future kernel revisions, openat2()
>        requires userspace to specify the size of struct open_how structure they are passing.   By
>        providing  this  information,  it  is possible for openat2() to provide both forwards- and
>        backwards-compatibility — with size acting as an implicit version number (because new  ex-
>        tension  fields will always be appended, the size will always increase.)  This extensibil-
>        ity  design  is  very  similar  to   other   system   calls   such   as   perf_setattr(2),
>        perf_event_open(2), and clone(3).
>        If  we let usize be the size of the structure according to userspace and ksize be the size
>        of the structure which the kernel supports, then there are only three cases to consider:
>               *  If ksize equals usize, then there is no version mismatch and  how  can  be  used
>                  verbatim.
>               *  If  ksize  is  larger than usize, then there are some extensions the kernel sup-
>                  ports which the userspace program is unaware of.  Because  all  extensions  must
>                  have their zero values be a no-op, the kernel treats all of the extension fields
>                  not set by userspace to have zero values.  This  provides  backwards-compatibil-
>                  ity.
>               *  If  ksize  is  smaller  than  usize,  then  there  are some extensions which the
>                  userspace program is aware of but the kernel does not support.  Because all  ex-
>                  tensions  must  have  their zero values be a no-op, the kernel can safely ignore
>                  the unsupported extension fields if they are all-zero.  If any  unsupported  ex-
>                  tension  fields  are  non-zero,  then  -1 is returned and errno is set to E2BIG.
>                  This provides forwards-compatibility.
>        Therefore, most userspace programs will not need to have any special  handling  of  exten-
>        sions.   However,  if  a userspace program wishes to determine what extensions the running
>        kernel supports, they may conduct a binary search on size (to find the largest value which
>        doesn't produce an error of E2BIG.)
>        openat(2), path_resolution(7), symlink(7)
> Linux                                       2019-11-05                                 OPENAT2(2)
> --8<---------------------------------------------------------------------------
> Aleksa Sarai (9):
>   namei: LOOKUP_NO_SYMLINKS: block symlink resolution
>   namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
>   namei: LOOKUP_NO_XDEV: block mountpoint crossing
>   namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
>   namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
>   namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
>   open: introduce openat2(2) syscall
>   selftests: add openat2(2) selftests
>   Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
>  CREDITS                                       |   4 +-
>  Documentation/filesystems/path-lookup.rst     |  18 +-
>  arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
>  arch/arm/tools/syscall.tbl                    |   1 +
>  arch/arm64/include/asm/unistd.h               |   2 +-
>  arch/arm64/include/asm/unistd32.h             |   2 +
>  arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
>  arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
>  arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
>  arch/s390/kernel/syscalls/syscall.tbl         |   1 +
>  arch/sh/kernel/syscalls/syscall.tbl           |   1 +
>  arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
>  fs/namei.c                                    | 176 +++++-
>  fs/open.c                                     | 149 +++--
>  include/linux/fcntl.h                         |  12 +-
>  include/linux/namei.h                         |  11 +
>  include/linux/syscalls.h                      |   3 +
>  include/uapi/asm-generic/unistd.h             |   5 +-
>  include/uapi/linux/fcntl.h                    |  41 ++
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/openat2/.gitignore    |   1 +
>  tools/testing/selftests/openat2/Makefile      |   8 +
>  tools/testing/selftests/openat2/helpers.c     | 109 ++++
>  tools/testing/selftests/openat2/helpers.h     | 107 ++++
>  .../testing/selftests/openat2/openat2_test.c  | 316 +++++++++++
>  .../selftests/openat2/rename_attack_test.c    | 160 ++++++
>  .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
>  35 files changed, 1591 insertions(+), 73 deletions(-)
>  create mode 100644 tools/testing/selftests/openat2/.gitignore
>  create mode 100644 tools/testing/selftests/openat2/Makefile
>  create mode 100644 tools/testing/selftests/openat2/helpers.c
>  create mode 100644 tools/testing/selftests/openat2/helpers.h
>  create mode 100644 tools/testing/selftests/openat2/openat2_test.c
>  create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
>  create mode 100644 tools/testing/selftests/openat2/resolve_test.c
> base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9

