[PATCH 3/6] seccomp: Implement constant action bitmaps

Kees Cook keescook at chromium.org
Thu Sep 24 07:36:56 UTC 2020


On Thu, Sep 24, 2020 at 02:25:03AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook at chromium.org> wrote:
> > +/* When no bits are set for a syscall, filters are run. */
> > +struct seccomp_bitmaps {
> > +#ifdef SECCOMP_ARCH
> > +       /* "allow" are initialized to set and only ever get cleared. */
> > +       DECLARE_BITMAP(allow, NR_syscalls);
> 
> This bitmap makes sense.
> 
> > +       /* These are initialized to clear and only ever get set. */
> > +       DECLARE_BITMAP(kill_thread, NR_syscalls);
> > +       DECLARE_BITMAP(kill_process, NR_syscalls);
> 
> I don't think these bitmaps make sense, this is not part of any fastpath.

That's a fair point. I think I arrived at this design because it ended
up making filter addition faster ("don't bother processing this one,
it's already 'kill'"), but it's likely not worse the memory usage
trade-off.

> (However, a "which syscalls have a fixed result" bitmap might make
> sense if we want to export the list of permitted syscalls as a text
> file in procfs, as I mentioned over at
> <https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)

I haven't found a data structure I'm happy with for this. It seemed like
NR_syscalls * sizeof(u32) was rather a lot (i.e. to store the BPF_RET
value). However, let me discuss that more in the "why in in thread?"
below...

> The "NR_syscalls" part assumes that the compat syscall tables will not
> be bigger than the native syscall table, right? I guess that's usually
> mostly true nowadays, thanks to the syscall table unification...
> (might be worth a comment though)

Hrm, I had convinced myself it was a max() of compat. But I see no
evidence of that now. Which means that I can add these to the per-arch
seccomp defines with something like:

# define SECCOMP_NR_NATIVE	NR_syscalls
# define SECCOMP_NR_COMPAT	X32_NR_syscalls
...

> > +#endif
> > +};
> > +
> >  struct seccomp_filter;
> >  /**
> >   * struct seccomp - the state of a seccomp'ed process
> > @@ -45,6 +56,13 @@ struct seccomp {
> >  #endif
> >         atomic_t filter_count;
> >         struct seccomp_filter *filter;
> > +       struct seccomp_bitmaps native;
> > +#ifdef CONFIG_COMPAT
> > +       struct seccomp_bitmaps compat;
> > +#endif
> > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> > +       struct seccomp_bitmaps multiplex;
> > +#endif
> 
> Why do we have one bitmap per thread (in struct seccomp) instead of
> putting the bitmap for a given filter and all its ancestors into the
> seccomp_filter?

I explicitly didn't want to add code that was run per-filter; I wanted
O(1), not O(n) even if the n work was a small constant. There is
obviously a memory/perf tradeoff here. I wonder if the middle ground
would be to put a bitmap and "constant action" results in the filter....
oh duh. The "top" filter is already going to be composed with its
ancestors. That's all that needs to be checked. Then the tri-state can
be:

bitmap accept[NR_syscalls]: accept or check "known" bitmap
bitmap filter[NR_syscalls]: run filter or return known action
u32 known_action[NR_syscalls];

(times syscall numbering "architecture" counts)

Though perhaps it would be just as fast as:

bitmap run_filter[NR_syscalls]: run filter or return known_action
u32 known_action[NR_syscalls];

where accept isn't treated special...

> 
> >  };
> >
> >  #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 0a3ff8eb8aea..111a238bc532 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
> >
> >  #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> >         if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
> > -               seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> > +               seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> >                                 SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
> 
> This belongs over into patch 1.

Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend
time fighting with arch and Kconfig stuff. :) I'll clean this (and the
other random cruft) up.

-- 
Kees Cook


More information about the Containers mailing list