[PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier
tycho at tycho.pizza
Wed Apr 28 00:22:15 UTC 2021
On Tue, Apr 27, 2021 at 04:19:54PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 3:10 PM Sargun Dhillon <sargun at sargun.me> wrote:
> > On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> > > On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > > > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <tycho at tycho.pizza> wrote:
> > > > >
> > > > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > > >
> > > > ISTM the current behavior is severely broken, and the new behavior
> > > > isn't *that* much better since it simply ignores signals and can't
> > > > emulate -EINTR (or all the various restart modes, sigh). Surely the
> > > > right behavior is to have the seccomped process notice that it got a
> > > > signal and inform the monitor of that fact so that the monitor can
> > > > take appropriate action.
> > >
> > > This doesn't help your case (2) though, since the IO could be done
> > > before the supervisor gets the notification.
> Tycho, I disagree. Here's how native syscalls work:
> 1. Entry work is done and the syscall hander does whatever it does at
> the beginning of the function. This is entirely non-interruptible.
> 2. The syscall handler decides it wants to wait, interruptibly,
> killably or otherwise.
> 3. It gets signaled. It takes appropriate action. Appropriate action
> does *not* mean -EINTR. It means that something that is correct *for
> that syscall* happens. For nanosleep(), this involves the restart
> block (and I don't think we need to support the restart block). For
> accept(), it mostly seems to mean that the syscall completes as usual.
> For write(2), it means that, depending on file type and whether any IO
> has occured, either -EINTR is returned and no IO happens, or fewer
> bytes than requested are transferred, or the syscall completes. (Or,
> if it's a KILL, the process dies early and partial IO is ignored.)
> For some syscalls (some AF_UNIX writes, for example, or ptrace()), the
> syscall indeed gets interrupted, but it uses one of the -ERESTART
> User notifiers should allow correct emulation. Right now, it doesn't,
> but there is no reason it can't.
Thanks for the explanation.
Consider fsmount, which has a,
ret = mutex_lock_interruptible(&fc->uapi_mutex);
if (ret < 0)
If a regular task is interrupted during that wait, it return -EINTR
or whatever back to userspace.
Suppose that we intercept fsmount. The supervisor decides the mount is
OK, does the fsmount, injects the mount fd into the container, and
then the tracee receives a signal. At this point, the mount fd is
visible inside the container. The supervisor gets a notification about
the signal and revokes the mount fd, but there was some time where it
was exposed in the container, whereas with the interrupt in the native
syscall there was never any exposure.
How does the supervisor correctly emulate this syscall? Maybe this
doesn't matter and this is just "completes as usual"? But then
handling signals is just a latency issue, not a correctness one. Or
most likely I'm misunderstanding something else :)
More information about the Containers