[CFT][PATCH 00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)

Thu May 28 19:14:20 UTC 2015

Kenton Varda <kenton at sandstorm.io> writes:

> On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto at amacapital.net> wrote:
>> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
>> <ebiederm at xmission.com> wrote:
>>> Serge Hallyn <serge.hallyn at ubuntu.com> writes:
>>>
>>>> Quoting Andy Lutomirski (luto at amacapital.net):
>>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>>>> <ebiederm at xmission.com> wrote:
>>>>> > I had hoped to get some Tested-By's on that patch series.
>>>>>
>>>>> Sorry, I've been totally swamped.
>>>>>
>>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>>>> containers, but I'll see if I can test it for real this weekend.
>>>>
>>>> Testing this with unprivileged containers, I get
>>>>
>>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>>>> - error mounting sysfs on
>>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>>>
>>> Grr..  I was afraid this would break something. :(
>>>
>>> Looking at my system I see that sysfs is currently mounted
>>> "nosuid,nodev,noexec"
>>>
>>> Looking at the lxc-start code I don't see it as including any of those
>>> mount options.  In practice for sysfs I think those options are
>>> meaningless (as there should be no devices and nothing executable in
>>> sysfs) but I can understand the past concerns with chmod on virtual
>>> filesystems that would incline people to use them, so I think the
>>> failure is reporting a legitimate security issue in the lxc userspace
>>> code where the the unprivileged code is currently attempting to give
>>> greater access to sysfs than was given by the original mount of sysfs.
>>>
>>> As nosuid,nodev,noexec should not impair the operation of sysfs
>>> operation it looks like you can always specify those options and just
>>> make this concern go away.
>>
>> Linus is pretty strict about not breaking the ABI, and this definitely
>> counts as breaking the ABI.  There's an exception for security issues,
>> but is there really a security issue here?  That is, do we lose
>> anything important if we just drop the offending part of the patch
>> set?  As you've said, there shouldn't be sensitive device nodes,
>> executables, or setuid files in proc or sysfs in the first place.

We do need to enforce retaining the existing mount flags one way or
another.  Where this really matters is with MS_RDONLY.  We don't want
any old user to be able to mount /proc read-write when root mounted it
read-only.  There is a very real attack vector there.  That attack
almost works in docker container today and is avoided simply because
docker mounts over a few files on proc.

Which leads to the second side of the reason for these changes.   I am
fixing a very small but long standing ABI break.   That is in some small
ways I broke some sandboxes and when I realized they were broken I could
not imagine think how to fix the code until now.

It is the goal that user namespaces don't introduce anything for people
to worry about security wise more than simply the ability to execute
more kernel code.  So at least when the kernel implementation is correct
developers of existing applications simply do not need care.  Sadly we are
not quite there yet.

> Speaking as a user of the mount() interfaces, I really think it would
> be less confusing overall if mount() simply ignored the requested
> flags when the caller doesn't have a choice. That is, in cases where
> mount() currently fails with EPERM when not given, say, MS_NOSUID, it
> should instead just pretend the caller actually set MS_NOSUID and go
> ahead with a nosuid mount. Or put another way, the absence of
> MS_NOSUID should not be interpreted as "remove the nosuid bit" but
> rather "don't set the nosuid bit if not required".

I am conflicted.  Implicits are nice but confusing.  If we can do
something reliable and robust and maintainable here that is truly worth
the cost I am all for it.

If I mount proc read-write I likely want to be able to write to proc
files, and I will be much happier if the mount fails than if a bazillion
syscalls later something else fails when it tries to write to proc.

> Consider:
>
> - This approach will actually cause lxc to have the correct behavior,
> without any changes to lxc. I suspect that this generalizes: In the
> vast majority of cases, when users have failed to set MS_NOSUID, it's
> not because they are explicitly requesting that the flag be turned
> off, but rather that they didn't know it mattered.
>
> - If a user actually *does* expect not passing MS_NOSUID to remove the
> nosuid bit, and they find instead that the nosuid bit is silently
> kept, I don't think they'll be confused: it's pretty obvious in
> context that this must be for security reasons.
>
> - On the other hand, the current behavior *is* very confusing: mount()
> returns EPERM because of rules the caller probably doesn't know
> anything about. I've spent a fair amount of time frustrated by this
> sort of thing.

My sympathies.  This all started with an oh crap we overlooked corner
case X and it actually matters, and the fixes were quite likely a little
bit hasty.  The only case where this really shows up is remount insode
of a user namespace of filesystems that were mounted outside of the user
namespace is where this all actually matters.  And mounting new
instances of proc and sysfs wind up being weird instances of that
nonsense.

But please someone test sandstorm with this patchset and tell me if it
bites you.  The impetus to find a way to avoid breaking slightly buggy
userspace is higher if it is more than unprivileged lxc that is broken.

Eric