pivot_root(".", ".") and the fchdir() dance

Eric W. Biederman ebiederm at xmission.com
Mon Sep 30 11:42:30 UTC 2019


"Michael Kerrisk (man-pages)" <mtk.manpages at gmail.com> writes:

> Hello Eric,
>
> A ping on my question below. Could you take a look please?
>
> Thanks,
>
> Michael
>
>>>>> The concern from our conversation at the container mini-summit was that
>>>>> there is a pathology if in your initial mount namespace all of the
>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>> if you are going to use mount propagation), that if new_root itself
>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>
>>>>> So I believe the desired sequence is:
>>>>>
>>>>>>>>            chdir(new_root);
>>>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>>            pivot_root(".", ".");
>>>>>>>>            umount2(".", MNT_DETACH);
>>>>>
>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>> parent mount namespace.
>>>>
>>>> Thanks. I made that change.
>>>
>>> For what it is worth.  The sequence above without the change in mount
>>> attributes will fail if it is necessary to change the mount attributes
>>> as "." is both put_old as well as new_root.
>>>
>>> When I initially suggested the change I saw "." was new_root and forgot
>>> "." was also put_old.  So I thought there was a silent danger without
>>> that sequence.
>> 
>> So, now I am a little confused by the comments you added here. Do you
>> now mean that the 
>> 
>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>> 
>> call is not actually necessary?

Apologies for being slow getting back to you.

To my knowledge there are two cases where pivot_root is used.
- In the initial mount namespace from a ramdisk when mounting root.
  This is the original use case and somewhat historical as rootfs
  (aka an initial ramfs) may not be unmounted.

- When setting up a new mount namespace to jettison all of the mounts
  you don't need.

The sequence:

	chdir(new_root);
        pivot_root(".", ".");
        umount2(".", MNT_DETACH);

is perfect for both use cases (as nothing needs to be known about the
directory layout of the new root filesystem).

In the case when you are setting up a new mount namespace propogating
changes in the mount layout to another mount namespace is fatal.  But
that is not a concern for using that pivot_root sequence above because
pivot_root will fail deterministically if
'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.

So I would document the above sequence of three system calls in the
man-page.

I would document that pivot_root will fail if propagation would occur.

I would document in pivot_root or under unshare(CLONE_NEWNS) that if
mount propagation is enabled (the default with systemd) that you
need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
namespace.  Or mounts will propagate backwards, which is usually
not what people want.

Creating of a mount namespace in a user namespace automatically does
'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
namespace was not created in that user namespace.  AKA creating
a mount namespace in a user namespace does the unshare for you.

Eric


More information about the Containers mailing list