[PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts

Al Viro viro at ZenIV.linux.org.uk
Thu Jan 8 00:22:27 UTC 2015

On Wed, Jan 07, 2015 at 03:51:17PM -0600, Eric W. Biederman wrote:

> I disagree with how you have described the current rules.

In which situation that description does not match what we do?

> Currently the reference count on a struct mount is:
> - incremented by 1 for every child listed in mnt_mounts.

Yes - their ->mnt_parent does it.

> - incremented by 1 for every oridinary user that does not know
>   the intimate details of how mounts are implemented.

That'd be references other than ->mnt_parent.

> - incremented by 1 if the mount is on the mnt_list and is mounted.

== reachable from hashtable, except that you include the reference from
namespace to its root mount into that; in my description that goes as
external reference.

> - Mounts that have been unmounted do not mounted do not have children.


> What I have introduced is state where a mount that is unmounted has
> children, and the connection between the mount and those children
> remain in the mount hash table.
> In this new case it becomes possible for a mount with children to reach
> the final mntput.

> In this new case an unmounted mount with children that reaches mntput
> will hold a reference to each of it's children until it reaches mntput,
> and there the reference will be dropped.
> To get into that state in umount_tree the code drops the reference to the
> parent from children as normal.  The code repairs the child list which
> was torn down early.  The code takes mnt_list and sets mnt_ns = NULL so
> the mount is clearly unmounted.  The reference count from being mounted
> is given to the mount parent.
> At the parents final mntput the children are walked through and removed
> from the mount hash table, and in cleanup_mnt where we don't have an
> ultra deep stack and are not holding locks, the dentry is dput and
> and the child mount is mntput.

You are describing what your code does.  Thank you, but I *can* read C just
as easy as your text (easier, TBH).  What I'm asking for is data structures
invariants.  As in, "outside of such and such locks, the following predicates
are true"...

If I understand what you are saying, you have
	* subset of mounts (ones that had MNT_LOCKED and went through
MNT_DETACH umount).
	* mounts in that subset are
		* hashed
		* present in their parents' lists of children
		* do _NOT_ contribute to parents' refcounts
		* do _NOT_ participate in event propagation
		* removed from the lists of children and dropped upon
parent's final mntput (dropping being done on shallow stack, just before the
fs shutdown of parent).

I can buy that, *IF* one could add
		* are guaranteed to be unreachable from any namespace root.
to the above.  And yes, your logics around marking them is probably enough
for that.  However, if we go that way, I don't understand why do we need
to involve MNT_LOCKED.

Look: your new rules for that subset are actually not that different from
the old ones - they *are* hashed, so by the old rules we'd added 1 to
refcount anyway.  The only period when that is not true is from __detach_mnt()
in mntput_no_expire() to cleanup_mnt().  The real difference, AFAICS, is
that their ->mnt_parent does not contribute to refcount of its targets.

If the root of such a detached tree gets the last reference dropped,
it gets killed as usual *and* its children become roots of detached trees.
That is protected by mount hash lock (which prevents somebody inside a subtree
from traversing .. and grabbing a reference to root after we'd started to
detach stuff from it).

Fine, that makes sense, but IMO it would make a lot of sense to have a variant
of MNT_DETACH that would do that for *all* mounts, locked or not.  I'd be
tempted to make MNT_DETACH itself do that, except that the current behaviour
(dropping eagerly) is useful in situations like "we have a stuck filesystem
and a bunch of stuff under it; let's do umount -l on that shit, hopefully
the non-busy filesystems mounted down there will get a clean shutdown out
of that".

I seriously dislike your use of path_put() on mnt_ex_mountpoint - it's
actively misleading.  We bloody better _not_ have non-NULL
->mnt_ex_mountpoint.mnt in your cleanup_mnt(), for obvious reasons.  So
it's just dput() + do something random if non-NULL mnt found.

Not sure if I like the use of mnt_child between mntput_no_expire() and
cleanup_mnt() - it's probably safe, but the fewer lists we modify outside of
mount hash lock, the better; hell knows, I'll need to stare at that code
a bit more.  FWIW, AFAICS the refcount rules with your variant are
	* external references countribute
	* mnt_parent contributes unless it points to ourselves *or* mnt_ns is
	* reachable from mount hash => add 1
	* in addition to the wart around namespace_unlock(), we have a similar
wart between mntput_no_expire() and cleanup_mnt(), only there we thread the
suckers on mnt_child instead of mnt_list and use slightly different logics to
prevent shutdown of parent fs before the dput(mountpoint).

I really hate synchronize_rcu() in namespace_unlock() ;-/  Not your fault,
but it's a PITA every time one does analysis of all that code...  It's
about legitimize_mnt() vs. umount() - we want to make sure that any
transient bumps of refcount from attempts to legitimize a lazy reference
to vfsmount getting killed by umount() will be undone *before* we get to
mntput_no_expire() in namespace_unlock().  It's only an issue for sync
umounts, so we are fine here, but it definitely needs commenting upon -
some vfsmounts *can* escape it with that approach, but they are not going
to be marked sync-umount, so we don't need to worry about them...

More information about the Containers mailing list