[RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

Amir Goldstein amir73il at gmail.com
Fri Nov 2 10:02:45 UTC 2018


On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee at canonical.com> wrote:
>
> shiftfs mounts cannot be nested for two reasons -- global
> CAP_SYS_ADMIN is required to set up a mark mount, and a single
> functional shiftfs mount meets the filesystem stacking depth
> limit.
>
> The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> ids in a mount must be within that mount's s_user_ns, so all that
> is needed is CAP_SYS_ADMIN within that s_user_ns.
>
> The stack depth issue can be worked around with a couple of
> adjustments. First, a mark mount doesn't really need to count
> against the stacking depth as it doesn't contribute to the call
> stack depth during filesystem operations. Therefore the mount
> over the mark mount only needs to count as one more than the
> lower filesystems stack depth.

That's true, but it also highlights the point that the "mark" sb is
completely unneeded and it really is just a nice little hack.
All the information it really stores is a lower mount reference,
a lower root dentry and a declarative statement "I am shiftable!".

Come to think about it, "container shiftable" really isn't that different from
NFS export with "no_root_squash" and auto mounted USB drive.
I mean the shifting itself is different of course, but the
declaration, not so much.
If I am allowing sudoers on another machine to mess with root owned
files visible
on my machine, I am pretty much have the same issues as container admins
accessing root owned files on my init_user_ns filesystem. In all those cases,
I'd better not be executing suid binaries from the untrusted "external" source.

Instead of mounting a dummy filesystem to make the declaration, you could
get the same thing with:
   mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
or whatnot)  constant to uapi and manage to come up good man page description.

Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
avoid the extra bind mount step (for a full filesystem tree export).
Declaring a mounted image MS_EXTERN has merits on its own even without
containers and shitfs, for example with pluggable storage. Other LSMs could make
good use of that declaration.

>
> Second, when the lower mount is shiftfs we can just skip down to
> that mount's lower filesystem and shift ids relative to that.
> There is no reason to shift ids twice, and the lower path has
> already been marked safe for id shifting by a user privileged
> towards all ids in that mount's user ns.
>
> Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
> ---
>  fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> index b19af7b2fe75..008ace2842b9 100644
> --- a/fs/shiftfs.c
> +++ b/fs/shiftfs.c
> @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         struct shiftfs_data *data = raw_data;
>         char *name = kstrdup(data->path, GFP_KERNEL);
>         int err = -ENOMEM;
> -       struct shiftfs_super_info *ssi = NULL;
> +       struct shiftfs_super_info *ssi = NULL, *mp_ssi;
>         struct path path;
>         struct dentry *dentry;
>
> @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         if (err)
>                 goto out;
>
> -       /* to mark a mount point, must be real root */
> -       if (ssi->mark && !capable(CAP_SYS_ADMIN))
> -               goto out;
> -
> -       /* else to mount a mark, must be userns admin */
> +       /* to mount a mark, must be userns admin */
>         if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
>                 goto out;

Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

>
> @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>
>         if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
>                 err = -ENOTDIR;
> -               goto out_put;
> -       }
> -
> -       sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> -       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -               printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> -               err = -EINVAL;
> -               goto out_put;
> +               goto out_put_path;
>         }
>
>         if (ssi->mark) {
> +               struct super_block *lower_sb = path.mnt->mnt_sb;
> +
> +               /* to mark a mount point, must root wrt lower s_user_ns */
> +               if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> +                       goto out_put_path;
> +
> +
>                 /*
>                  * this part is visible unshifted, so make sure no
>                  * executables that could be used to give suid
>                  * privileges
>                  */
>                 sb->s_iflags = SB_I_NOEXEC;

As commented on cover letter, why allow access to any files besides root at all?
In fact, the only justification for a dummy sb (instead of bind mount with
MS_EXTERN flag) would be in order to override inode operations with noop ops
to prevent access to unshifted files from within container.

Thanks,
Amir.


More information about the Containers mailing list