[PATCH v3 1/4] seccomp: add a return code to trap to userspace

Matthew Helsley matt.helsley at gmail.com
Tue Jun 12 21:39:03 UTC 2018


On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <tycho at tycho.ws> wrote:

<snip>


> +struct seccomp_notif {
> +       __u64 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
>

Since it's part of the UAPI I think it would be good to add documentation
to this patch series. Part of that documentation should talk about which
pid namespaces this pid value is relevant in. This is especially important
since the feature is designed for use by things like container/sandbox
managers.


> +
> +struct seccomp_notif_resp {
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
>

<snip>


> +struct seccomp_knotif {
> +       /* The pid whose filter triggered the notification */
> +       pid_t pid;

+
> +       /*
> +        * The "cookie" for this request; this is unique for this filter.
> +        */
> +       u32 id;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter
> which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +        * struct seccomp_knotif is created and starts out in INIT. Once
> the
> +        * handler reads the notification off of an FD, it transitions to
> READ.
> +        * If a signal is received the state transitions back to INIT and
> +        * another message is sent. When the userspace handler replies,
> state
> +        * transitions to REPLIED.
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +#endif
>

<snip>


> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       err = -ENOSYS;
> +       if (!match->has_listener)
> +               goto out;
> +
> +       n.pid = current->pid;
>

How have you tested this code for correctness? I don't see many
combinations being tested below nor here:
https://github.com/tych0/kernel-utils/blob/master/seccomp/notify_stress.c

What about processes in different pid namespaces? Make tests that vary key
parameters like these between the task generating the notifications and the
task interested in processing the notifications. Make tests that try to
kill them at interesting times too. etc. Make tests that pass the
notification fd around and see how they work (or not).

I ask about testing because you're effectively returning a pid value to
userspace here but not using the proper macros to access the task's struct
pid for that purpose. That will work so long as both processes are in the
same pid namespace but breaks otherwise.

Furthermore, a pid value is not the best solution for the queueing of these
notifications because a single pid value is not meaningful/correct outside
current's pid namespace and the seccomp notification file descriptor could
be passed on to processes in another pid namespaces; this pid value will
almost always not be relevant or correct there hence taking a reference to
the struct pid is useful. Then, during read(), the code would use the
proper macro to turn the struct pid reference into a pid value relevant in
the *reader's* pid namespace *or* return something obviously bogus if the
reader is in a pid namespace that can't see that pid. This could prevent
management processes from being tricked into clobbering the wrong process,
feeding the wrong process sensitive information via syscall results, etc.

Alternately, you could choose to specify that the seccomp manager is
expected to be in the pid namespace of the process it's managing at all
times. That's not necessarily trivial either because the process(es) it
manages could potentially create new child pid namespaces. It also means
that the processes being managed can "see" the manager process at all times.

Regardless, you still need to use the proper macros to access current's pid
for export to userspace.

+       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(match);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notifications);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       mutex_lock(&match->notify_lock);
> +
> +       /*
> +        * Here it's possible we got a signal and then had to wait on the
> mutex
> +        * while the reply was sent, so let's be sure there wasn't a
> response
> +        * in the meantime.
> +        */
> +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +               /*
> +                * We got a signal. Let's tell userspace about it
> (potentially
> +                * again, if we had already notified them about the first
> one).
> +                */
> +               if (n.state == SECCOMP_NOTIFY_SENT) {
> +                       n.state = SECCOMP_NOTIFY_INIT;
> +                       up(&match->request);
> +               }
> +               mutex_unlock(&match->notify_lock);
> +               err = wait_for_completion_killable(&n.ready);
> +               mutex_lock(&match->notify_lock);
> +               if (err < 0)
> +                       goto remove_list;
> +       }
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
>

<snip>


> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +                                  size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = f->private_data;
> +       struct seccomp_knotif *knotif = NULL, *cur;
> +       struct seccomp_notif unotif;
> +       ssize_t ret;
> +
> +       /* No offset reads. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       ret = down_interruptible(&filter->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each_entry(cur, &filter->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> +                       knotif = cur;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * If we didn't find a notification, it could be that the task was
> +        * interrupted between the time we were woken and when we were
> able to
> +        * acquire the rw lock. Should we retry here or just -ENOENT?
> -ENOENT
> +        * for now.
> +        */
> +       if (!knotif) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       unotif.id = knotif->id;
> +       unotif.pid = knotif->pid;
> +       unotif.data = *(knotif->data);
> +
> +       size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = sizeof(unotif);
> +       knotif->state = SECCOMP_NOTIFY_SENT;
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
>

<snip>


> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 168c66d74fc5..f439bd3fb15f 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,10 +40,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>
>  #define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>
>  #include "../kselftest_harness.h"
>
> @@ -150,6 +152,24 @@ struct seccomp_metadata {
>  };
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +       __u64 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -                                SECCOMP_FILTER_FLAG_LOG };
> +                                SECCOMP_FILTER_FLAG_LOG,
> +                                SECCOMP_FILTER_FLAG_GET_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2917,6 +2938,178 @@ TEST(get_metadata)
>         ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +       int ret;
> +
> +       do {
> +               errno = 0;
> +               ret = read(listener, req, sizeof(*req));
> +       } while (ret == -1 && errno == ENOENT);
> +       return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +       struct pollfd pollfd;
> +
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               if (user_trap_syscall(__NR_getpid, 0) < 0)
> +                       exit(1);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Add some no-op filters so that we (don't) trigger lockdep. */
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_GET_LISTENER);
> +       EXPECT_GE(listener, 0);
> +
> +       /* Installing a second listener in the chain should EBUSY */
> +       EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +                                   SECCOMP_FILTER_FLAG_GET_LISTENER),
> +                 -1);
> +       EXPECT_EQ(errno, EBUSY);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +       pollfd.fd = listener;
> +       pollfd.events = POLLIN | POLLOUT;
> +
> +       EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +       EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +       EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the
> middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read(listener, &req, sizeof(req));
> +       EXPECT_EQ(ret, sizeof(req));
> +
> +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, EINVAL);
> +
> +       /*
> +        * Check that we get another notification about a signal in the
> middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +                       perror("signal");
> +                       exit(1);
> +               }
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, sizeof(resp));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       close(listener);
> +}
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
>

More combinations of tests would be good too.

Cheers,
     -Matt Helsley


More information about the Containers mailing list