Regression wrt mounting /proc in user namespace in 3.13
Serge E. Hallyn
serge at hallyn.com
Sat Nov 16 16:48:40 UTC 2013
Quoting Daniel P. Berrange (berrange at redhat.com):
> Just testing libvirt with user namespaces on current Fedora rawhide
> 3.13.0-0.rc0.git3.2.fc21.x86_64 kernel, I'm now getting an error when
> we attempt to mount /proc
Thanks, I saw the same thing with 3.12 on friday afternoon, and decided
I must be too haggard from a week of unrelated work to think straight.
This definately will be a problem, making user namespace unusable for
containers.
> # virsh -c lxc:/// start shell
> error: Failed to start domain shell
> error: internal error: guest failed to start: Failed to mount proc on /proc type proc flags=e: Operation not permitted
>
> The syscall failing is
>
> mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)
>
>
> On the host OS the default Fedora environment has the following mounts
> present
>
> # grep /proc /proc/mounts
> proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
> systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=41,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 0 0
> binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
> sunrpc /proc/fs/nfsd nfsd rw,relatime 0 0
>
> # ls /proc/fs/nfsd/
> export_features filehandle nfsv4gracetime nfsv4recoverydir pool_threads reply_cache_stats threads unlock_ip
> exports max_block_size nfsv4leasetime pool_stats portlist supported_krb5_enctypes unlock_filesystem versions
>
> # ls /proc/sys/fs/binfmt_misc/
> qemu-alpha qemu-cris qemu-microblazeel qemu-mips64el qemu-ppc64 qemu-sh4 qemu-sparc32plus status
> qemu-arm qemu-m68k qemu-mips qemu-mipsel qemu-ppc64abi32 qemu-sh4eb qemu-sparc64
> qemu-armeb qemu-microblaze qemu-mips64 qemu-ppc qemu-s390x qemu-sparc register
>
>
> Only if I umount both of the /proc/sys/fs/binfmt_misc/ entries
> am I able to get past this EPERM error code.
>
> Looking at GIT history I see this change as a likely candidate for
> something which has changed in this area:
>
> commit e51db73532955dc5eaba4235e62b74b460709d5b
> Author: Eric W. Biederman <ebiederm at xmission.com>
> Date: Sat Mar 30 19:57:41 2013 -0700
>
> userns: Better restrictions on when proc and sysfs can be mounted
>
> Rely on the fact that another flavor of the filesystem is already
> mounted and do not rely on state in the user namespace.
>
> Verify that the mounted filesystem is not covered in any significant
> way. I would love to verify that the previously mounted filesystem
> has no mounts on top but there are at least the directories
> /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
> for other filesystems to mount on top of.
>
> Refactor the test into a function named fs_fully_visible and call that
> function from the mount routines of proc and sysfs. This makes this
> test local to the filesystems involved and the results current of when
> the mounts take place, removing a weird threading of the user
> namespace, the mount namespace and the filesystems themselves.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
>
>
> My guess is fs_fully_visible() is returning false, and thus causing the
> proc_mount() call to return EPERM, but I'm unclear why this would happen,
> or if this is indeed a correct hypothesis.
>
>
> Regards,
> Daniel
> --
> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org -o- http://virt-manager.org :|
> |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
> _______________________________________________
> Containers mailing list
> Containers at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
More information about the Containers
mailing list