seccomp: Delay filter activation

Christian Brauner christian.brauner at ubuntu.com
Mon Mar 1 13:21:56 UTC 2021


On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote:
> On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> > We've run into a problem where attaching a filter can be quite messy
> > business because the filter itself intercepts sendmsg, and other
> > syscalls related to exfiltrating the listener FD. I believe that this
> > problem set has been brought up before, and although there are
> > "simpler" methods of exfiltrating the listener, like clone3 or
> > pidfd_getfd, but these are still less than ideal.
> 
> (You really like sending patches and discussion points in the middle of
> the merge window. :D I think everyone's panicked about getting their PRs
> in shape so it's not unlikely that this sometimes gets lost on the list. :))
> 
> It hasn't been a huge problem for us, especially since we added
> pidfd_getfd() this seemed like a straightforward problem to solve by
> selecting a fix fd number that is to be used for the listener. But I can
> see why it is annoying.
> 
> > 
> > One of the ideas that's been talked about (I want to say back at LSS
> > NA) is the idea of "delayed activation". I was thinking that it might
> > be nice to have a mechanism to do delayed attach, either activated on
> > execve / fork, or an ioctl on the listenerfd to activate the filter
> > and have a flag like SECCOMP_FILTER_FLAG_NEW_LISTENER_INACTIVE, which
> > indicates that the listener should be setup, but not enforcing, and
> > another ioctl to activate it.
> > 
> > The later approach is preferred due to simplicity, but I can see a
> > situation where you could accidentally get into a state where the
> > filter is not being enforced. Additionally, this may have unforeseen
> > implications with CRIU.
> 
> (If you were to expose an ioctl() that allows userspace to query the
> notifer state then CRIU shouldn't have a problem restoring the notifier
> in the correct state. Right now it doesn't do anyting fancy about the
> notifier, it just restores the task with the filter. It just has to
> learn about the new feature and that's fine imho.)
> 
> > 
> > I'm curious whether this is a problem others share, and whether any of
> > the aforementioned approaches seem reasonable.
> 
> So when I originally suggested the delayed activation I I had another
> related idea that I think I might have mentioned too: if we're already
> considering delaying filter activation I like to discuss the possibility
> of attaching a seccomp filter to a task.
> 
> Right now, if one task wants to attach to another task they need to
> recreate the whole seccomp filter and load it. That's not just pretty
> expensive but also only works if you have access to the rules that the
> filter was generated with. For container that's usually some sort of
> pseudo seccomp filter configuration language dumped into a config file
> from which it can be read.
> 
> So right now the status quo is:
> 
> struct sock_filter filter[] = {
>         BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
>         BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
>         BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF), /* Get me a listener fd */
>         BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> };
> struct sock_fprog prog = {
>         .len = (unsigned short)ARRAY_SIZE(filter),
>         .filter = filter,
> };
> int fd = seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> 
> and then the caller must send the fd to the manager or the manager uses
> pidfd_getfd().
> 
> But, why not get a bit crazy^wcreative; especially since seccomp() is
> already a multiplexer. We introduce a new seccomp flag:
> 
> #define SECCOMP_FILTER_DETACHED
> 
> and a new seccomp command:
> 
> #define SECCOMP_ATTACH_FILTER
> 
> And now we could do something like:
> 
> pid_t pid = fork();
> if (pid < 0)
> 	return;
> 
> if (pid == 0) {
> 	// do stuff
> 	BARRIER_WAKE_SETUP_DONE;
> 
> 	// do more unrelated stuff
> 
> 	BARRIER_WAIT_SECCOMP_FILTER;
> 	execve(exec-something);
> } else {
> 	
> 	int fd_filter;
> 
> 	struct sock_filter filter[] = {
> 	        BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
> 	        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
> 	        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> 	};
> 	
> 	struct sock_fprog prog = {
> 	        .len = (unsigned short)ARRAY_SIZE(filter),
> 	        .filter = filter,
> 	};
> 	
> 	int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog);
> 
> 	BARRIER_WAIT_SETUP_DONE;
> 
> 	int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener));

This obviously should've been sm like:

struct seccomp_filter_attach {
	union {
		__s32 pidfd;
		__s32 pid;
	};
	__u32 fd_filter;
};

and then

int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach);

> 
> 	BARRIER_WAKE_SECCOMP_FILTER;
> }
> 
> And now you have attached a filter to another task. This would be super
> elegant for a container manager. The container manager could also stash
> the filter fd and when attaching to a container the manager can send the
> attaching task the fd and the attaching task can do:
> 
> int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_filter));
> 
> too and would be attached to the same filter as the target task.
> 
> And for the listener fd case a container manager could simply set
> SECCOMP_RET_USER_NOTIF as before
> 
> 	struct sock_filter filter[] = {
> 	        BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
> 	        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
> 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
> 	        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> 	};
> 
> and now fd_filter simply functions as the notifier fd after
> seccomp(SECCOMP_ATTACH_FILTER) that's basically the fancy version of my
> delayed notifier activiation idea.
> 
> I'm sure there's nastiness to figure out but I would love to see
> something like this.
> 
> Christian


More information about the Containers mailing list