[RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

Fri Nov 2 10:02:45 UTC 2018

On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee at canonical.com> wrote:
>
> shiftfs mounts cannot be nested for two reasons -- global
> CAP_SYS_ADMIN is required to set up a mark mount, and a single
> functional shiftfs mount meets the filesystem stacking depth
> limit.
>
> The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> ids in a mount must be within that mount's s_user_ns, so all that
> is needed is CAP_SYS_ADMIN within that s_user_ns.
>
> The stack depth issue can be worked around with a couple of
> adjustments. First, a mark mount doesn't really need to count
> against the stacking depth as it doesn't contribute to the call
> stack depth during filesystem operations. Therefore the mount
> over the mark mount only needs to count as one more than the
> lower filesystems stack depth.

That's true, but it also highlights the point that the "mark" sb is
completely unneeded and it really is just a nice little hack.
All the information it really stores is a lower mount reference,
a lower root dentry and a declarative statement "I am shiftable!".

Come to think about it, "container shiftable" really isn't that different from
NFS export with "no_root_squash" and auto mounted USB drive.
I mean the shifting itself is different of course, but the
declaration, not so much.
If I am allowing sudoers on another machine to mess with root owned
files visible
on my machine, I am pretty much have the same issues as container admins
accessing root owned files on my init_user_ns filesystem. In all those cases,
I'd better not be executing suid binaries from the untrusted "external" source.

Instead of mounting a dummy filesystem to make the declaration, you could
get the same thing with:
   mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
or whatnot)  constant to uapi and manage to come up good man page description.

Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
avoid the extra bind mount step (for a full filesystem tree export).
Declaring a mounted image MS_EXTERN has merits on its own even without
containers and shitfs, for example with pluggable storage. Other LSMs could make
good use of that declaration.

>
> Second, when the lower mount is shiftfs we can just skip down to
> that mount's lower filesystem and shift ids relative to that.
> There is no reason to shift ids twice, and the lower path has
> already been marked safe for id shifting by a user privileged
> towards all ids in that mount's user ns.
>
> Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
> ---
>  fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> index b19af7b2fe75..008ace2842b9 100644
> --- a/fs/shiftfs.c
> +++ b/fs/shiftfs.c
> @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         struct shiftfs_data *data = raw_data;
>         char *name = kstrdup(data->path, GFP_KERNEL);
>         int err = -ENOMEM;
> -       struct shiftfs_super_info *ssi = NULL;
> +       struct shiftfs_super_info *ssi = NULL, *mp_ssi;
>         struct path path;
>         struct dentry *dentry;
>
> @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         if (err)
>                 goto out;
>
> -       /* to mark a mount point, must be real root */
> -       if (ssi->mark && !capable(CAP_SYS_ADMIN))
> -               goto out;
> -
> -       /* else to mount a mark, must be userns admin */
> +       /* to mount a mark, must be userns admin */
>         if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
>                 goto out;

Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

>
> @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>
>         if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
>                 err = -ENOTDIR;
> -               goto out_put;
> -       }
> -
> -       sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> -       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -               printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> -               err = -EINVAL;
> -               goto out_put;
> +               goto out_put_path;
>         }
>
>         if (ssi->mark) {
> +               struct super_block *lower_sb = path.mnt->mnt_sb;
> +
> +               /* to mark a mount point, must root wrt lower s_user_ns */
> +               if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> +                       goto out_put_path;
> +
> +
>                 /*
>                  * this part is visible unshifted, so make sure no
>                  * executables that could be used to give suid
>                  * privileges
>                  */
>                 sb->s_iflags = SB_I_NOEXEC;

As commented on cover letter, why allow access to any files besides root at all?
In fact, the only justification for a dummy sb (instead of bind mount with
MS_EXTERN flag) would be in order to override inode operations with noop ops
to prevent access to unshifted files from within container.

Thanks,
Amir.