seccomp feature development

Kees Cook keescook at chromium.org
Wed Jun 3 19:09:44 UTC 2020


[trying to get back to this thread -- I've been distracted]

On Tue, May 19, 2020 at 12:39:39AM +0200, Jann Horn wrote:
> On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook at chromium.org> wrote:
> > ## deep argument inspection
> >
> > Background: seccomp users would like to write filters that traverse
> > the user pointers passed into many syscalls, but seccomp can't do this
> > dereference for a variety of reasons (mostly involving race conditions and
> > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> Also, other than for syscall entry, it might be worth thinking about
> whether we want to have a special hook into seccomp for io_uring.
> io_uring is growing support for more and more syscalls, including
> things like openat2, connect, sendmsg, splice and so on, and that list
> is probably just going to grow in the future. If people start wanting
> to use io_uring in software with seccomp filters, it might be
> necessary to come up with some mechanism to prevent io_uring from
> permitting access to almost everything else...

/me perks up. Oh my, I hadn't been paying attention -- I thought this
was strictly for I/O ... like it's named. I will go read up.

> [...]
> > The argument caching bit is, I think, rather mechanical in nature since
> > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > allocates seccomp_data (maybe going so far as to have it split across two
> > pages with the syscall argument struct always starting on the 2nd page
> > boundary), and copying the EA struct into that page, which will be both
> > used by the filter and by the syscall.
> 
> We could also do the same kind of thing the eBPF verifier does in
> convert_ctx_accesses(), and rewrite the context accesses to actually
> go through two different pointers depending on the (constant) offset
> into seccomp_data.

Ah, as in "for seccomp_data accesses, add offset $foo and for EA struct
add offset $bar"? Yeah, though my preference is to avoid rewriting the
filters as much as possible. But yes, that's a good point about not
requiring them be strictly contiguous.

> > I imagine state tracking ("is
> > there a cached EA?", "what is the address of seccomp_data?", "what is
> > the address of the EA?") can be associated with the thread struct.
> 
> You probably mean the task struct?

Yup; think-o.

> > ## syscall bitmasks
> 
> YES PLEASE

I've got a working PoC for this now. It's sneaky.

> Other options:
>  - add a "load from read-only memory" opcode and permit specifying the
> data that should be in that memory when loading the filter

I think you've mentioned something like this before to me, but can you
remind me the details? If you mean RO userspace memory, don't we still
run the risk of racing mprotect, etc?

>  - make the seccomp API take an array of (syscall-number,
> instruction-offset) tuples, and start evaluation of the filter at an
> offset if it's one of those syscalls

To avoid making cBPF changes, yeah, perhaps have a way to add per-syscall
filters.

> One more thing that would be really nice: Is there a way we can have
> 64-bit registers in our seccomp filters? At the moment, every
> comparison has to turn into three ALU ops, which is pretty messy;
> libseccomp got that wrong (<https://crbug.com/project-zero/1769>), and
> it contributes to the horrific code Chrome's BPF generator creates.
> Here's some pseudocode from my hacky BPF disassembler, which shows
> pretty much just the filter code for filtering getpriority() and
> setpriority() in a Chrome renderer, with tons of useless dead code:
> 
> 0139         if args[0].high == 0x00000000: [true +3, false +0]
> 013e           if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
> 0140           if args[1].high == 0x00000000: [true +3, false +0]
> 0145             if args[1].low == 0x00000000: [true +149, false +0]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 0147             if args[1].high == 0x00000000: [true +3, false +0]
> 014c               if args[1].low == 0x00000001: [true +142, false
> +141] -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da               ret ERRNO
> 0148             if args[1].high != 0xffffffff: [true +142, false +0]
> -> ret TRAP
> 014a             if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0141           if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
> 0143           if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
> false +0] -> ret TRAP
> 0145           if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147           if args[1].high == 0x00000000: [true +3, false +0]
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 013a         if args[0].high != 0xffffffff: [true +156, false +0] -> ret TRAP
> 013c         if args[0].low NO-COMMON-BITS 0x80000000: [true +154,
> false +0] -> ret TRAP
> 013e         if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
> 0140         if args[1].high == 0x00000000: [true +3, false +0]
> 0145           if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147           if args[1].high == 0x00000000: [true +3, false +0]
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 0141         if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
> 0143         if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
> false +0] -> ret TRAP
> 0145         if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147         if args[1].high == 0x00000000: [true +3, false +0]
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 0148         if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a         if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c         if args[1].low == 0x00000001: [true +142, false +141] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 01da         ret ERRNO
> 
> which is generated by this little snippet of C++ code:
> 
> ResultExpr RestrictGetSetpriority(pid_t target_pid) {
>   const Arg<int> which(0);
>   const Arg<int> who(1);
>   return If(which == PRIO_PROCESS,
>             Switch(who).CASES((0, target_pid), Allow()).Default(Error(EPERM)))
>       .Else(CrashSIGSYS());
> }

What would this look like in eBPF?

> On anything other than 32-bit MIPS, 32-bit powerpc and 32-bit sparc,
> we're actually already using the eBPF backend infrastructure... so
> maybe it would be an option to keep enforcing basically the same rules
> that we currently have for cBPF, but use the eBPF instruction format?

Yeah, I think this might be a good idea just for the reduction in
complexity for these things. The "unpriv BPF" problem needs to be solved
for this still, though, yes?

-- 
Kees Cook


More information about the Containers mailing list