shiftfs status and future development

Mon Jun 18 13:40:32 UTC 2018

On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote:
> On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote:
> > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> > > Quoting Seth Forshee (seth.forshee at canonical.com):
> > > > I wanted to inquire about the current status of shiftfs and the
> > > > plans for it moving forward. We'd like to have this functionality
> > > > available for use in lxd, and I'm interesetd in helping with
> > > > development (or picking up development if it's stalled).
> > > > 
> > > > To start, is anyone still working on shiftfs or similar
> > > > functionality? I haven't found it in any git tree on kernel.org,
> > > > and as far as mailing list activity the last submission I can
> > > > find is [1]. Is there anything newer than this?
> > > > 
> > > > Based on past mailing list discussions, it seems like there was
> > > > still debate as to whether this feature should be an overlay
> > > > filesystem or something supported at the vfs level. Was this ever
> > > > resolved?
> > > > 
> > > > Thanks,
> > > > Seth
> > > > 
> > > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn
> > > > ership.com
> > > 
> > > Hey Seth,
> > > 
> > > I haven't heard anything in a long time.  But if this is going to
> > > pick back up, can we come up with a detailed set of goals and
> > > requirements?
> 
> That would actually help.
> 
> > I was planning to follow up later with some discussion of
> > requirements.
> > Here are some of ours:
> > 
> >  - Supports any id maps possible for a user namespace
> 
> Could you clarify: right at the moment, it basically reverses the
> namespace ID mapping when it does on to the filesystem using the
> superblock user namespace, so, in theory you can have an arbitrary
> mapping simply by changing the s_userns.  The problem here is that you
> don't have a lot of tools for manipulating the s_userns.

For our purposes the way you're shifting with s_user_ns works fine. I
know that Serge would prefer a more arbitrary shift so that an
arbitrary, unprivileged range in the source fs could be use (e.g. use
ids 100000 - 101000 in the source instead of 0 - 1000), and my thoughts
on that are quoted below.

> >  - Does not break inotify
> 
> I don't expect it does, but I haven't checked.

I haven't checked either; I'm planning to do so soon. This is a concern
that was expressed to me by others, I think because inotify doesn't work
with overlayfs.

> >  - Passes accurate disk usage and source information from the
> > "underlay"
> 
> mounts of this type don't currently show up in df
> 
> >  - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)
> 
> yes
> 
> >  - Works with nested containers
> 
> yes

I'd say not so much:

        /* to mark a mount point, must be real root */
        if (ssi->mark && !capable(CAP_SYS_ADMIN))
                goto out;

So within a container I cannot mark a range to be shiftfs-mountable
within a container I create. I'd argue that as long as a user has
CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it should
be safe to allow this as it implies privleges wrt all ids found in the
source mount. This will likely lead to stacked shiftfs mounts, not sure
yet whether or not this works in the current code.

> > I'm also interested in collecting any requirements others might have.
> > 
> > > I don't recall whether the last version still worked like this, but
> > > I'm still not comfortable with the idea of a system where after a
> > > reboot, container-created root-owned files are owned by host root
> > > until a path is specially marked.  Enforcing that the "source"
> > > directory is itself uid-shifted would greatly ease my mind.
> 
> And I believe we're discussing everything below in a different
> subthread.
> 
> James
> 
> 
> > I understand the concern and share the discomfort to some degree, but
> > I'm not convinced that requiring the source subtree be shifted is the
> > right approach.
> > 
> > First, let's address the marking question. As you stated, an approach
> > that leaves the subree unmarked for a period of time is problematic,
> > and imo this is a fatal flaw with marking as a protection for e.g.
> > execing some suid root file written by a container. Writing some such
> > mark to the filesystem would make it persistent, but it could also
> > limit the support to a limited set of filesystems.
> > 
> > However, I do think it's necessary for a user with sufficient
> > capabilities to "bless" a subtree for mounting in a less privileged
> > context, so this is a feature of marking that I would like to keep. I
> > think the new mount apis in David Howells' filesystem context patches
> > [1] might give us a nicer way to do this. For example, root in
> > init_user_ns could set up a mount fd which specifies the source
> > subtree for the id shift. At that time the kernel could check for
> > ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) for the filesystem
> > containing the source subtree. Then the fd could be passed to a
> > container in a user namespace, who could use it to attach the mount
> > to its filesystem tree.  The same concept could be extended to nested
> > containers, as long as the user setting the source subtree has
> > CAP_SYS_ADMIN towards sb->s_user_ns for the subtree.
> > 
> > Now back to reuiring the srouce subtree be id shifted. I understand
> > the motivation for wanting this, but I'm not sure I'm in favor of it.
> > To start, there are other ways to ensure that id shifted mounts don't
> > lead to problems, such as putting the subtree under a directory
> > accessible only by root or putting it in a nosuid or noexec mount.
> > For some implementations those sorts of protections are going to make
> > sense.
> > 
> > Having this requirement may also add significant time to mounting, as
> > I assume it would involve iterating through all filesystem objects.
> > 
> > Additionally, that requirement is likely to significantly complicate
> > the implementation. The simplest implementation would just translate
> > the k[ug]ids in the inodes to a target user ns. A slightly more
> > complicated approach might translate them based on a source and
> > destination user ns. If it's implemented based on passing in an
> > arbitrary id map at mount time it will be more complex and duplicate
> > functionality that user namespaces already give us.
> > 
> > Thanks,
> > Seth
> > 
> > [1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.st
> > git at warthog.procyon.org.uk
> > _______________________________________________
> > Containers mailing list
> > Containers at lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
>