[CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)

Thu May 28 20:12:04 UTC 2015

On Thu, May 28, 2015 at 12:14 PM, Eric W. Biederman
<ebiederm at xmission.com> wrote:
> But please someone test sandstorm with this patchset and tell me if it
> bites you.  The impetus to find a way to avoid breaking slightly buggy
> userspace is higher if it is more than unprivileged lxc that is broken.

One of these days I'm going to learn how to compile and test kernels
again (last time I did it was 1999). Unfortunately I don't think I
have time at the moment, but hopefully Andy can do it.

I note, though, that we only have two mount() calls in the sandstorm
codebase that seem like they could be affected:

run-bundle.c++:1264: KJ_SYSCALL(mount("proc", "proc", "proc",
MS_NOSUID | MS_NODEV | MS_NOEXEC, ""));
minibox.c++:251: KJ_SYSCALL(mount("proc", vpath.cStr(), "proc",
MS_NOSUID | MS_NODEV | MS_NOEXEC, ""),
supervisor.c++:921: KJ_SYSCALL(mount("/proc", "proc", nullptr, MS_BIND
| MS_REC, nullptr));

The first two seem like they should be fine since they set all the
flags (except readonly, which would be inappropriate for proc). I
guess my habit of setting every security flag I see came in handy. The
third case looks like it will be broken, BUT this line is in a
debug-only code path, so I don't care. Also we have the ability to
push any needed update within 24 hours, so we're generally in good
shape.

We never mount sysfs in Sandstorm.

> If I mount proc read-write I likely want to be able to write to proc
> files, and I will be much happier if the mount fails than if a bazillion
> syscalls later something else fails when it tries to write to proc.

I'm not sure that's true. Consider the broader context:
1) Your system's /proc is mounted read-only.
2) Now you're trying to mount a new proc in a new pid namespace, and
you do *not* specify MS_READONLY.

What should we expect here? Let's back off a bit and state user intent:
1) The system administrator has set a system-wide policy that /proc
may only be read, not written.
2) You made a PID namespace and it needed its own proc.

It seems intuitive here that the administrator's policy should apply
in the namespace. Certainly everyone using the system and/or all
software on the system already needs to be aware of this policy, since
it's unusual and will break things. Running software on this system
outside of any container already has the problem that syscalls
randomly break, so why should it be surprising when this happens
inside the container as well? Why do we need to go out of our way to
break at mount() time?

-Kenton