[PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts
viro at ZenIV.linux.org.uk
Thu Jan 8 00:22:27 UTC 2015
On Wed, Jan 07, 2015 at 03:51:17PM -0600, Eric W. Biederman wrote:
> I disagree with how you have described the current rules.
In which situation that description does not match what we do?
> Currently the reference count on a struct mount is:
> - incremented by 1 for every child listed in mnt_mounts.
Yes - their ->mnt_parent does it.
> - incremented by 1 for every oridinary user that does not know
> the intimate details of how mounts are implemented.
That'd be references other than ->mnt_parent.
> - incremented by 1 if the mount is on the mnt_list and is mounted.
== reachable from hashtable, except that you include the reference from
namespace to its root mount into that; in my description that goes as
> - Mounts that have been unmounted do not mounted do not have children.
> What I have introduced is state where a mount that is unmounted has
> children, and the connection between the mount and those children
> remain in the mount hash table.
> In this new case it becomes possible for a mount with children to reach
> the final mntput.
> In this new case an unmounted mount with children that reaches mntput
> will hold a reference to each of it's children until it reaches mntput,
> and there the reference will be dropped.
> To get into that state in umount_tree the code drops the reference to the
> parent from children as normal. The code repairs the child list which
> was torn down early. The code takes mnt_list and sets mnt_ns = NULL so
> the mount is clearly unmounted. The reference count from being mounted
> is given to the mount parent.
> At the parents final mntput the children are walked through and removed
> from the mount hash table, and in cleanup_mnt where we don't have an
> ultra deep stack and are not holding locks, the dentry is dput and
> and the child mount is mntput.
You are describing what your code does. Thank you, but I *can* read C just
as easy as your text (easier, TBH). What I'm asking for is data structures
invariants. As in, "outside of such and such locks, the following predicates
If I understand what you are saying, you have
* subset of mounts (ones that had MNT_LOCKED and went through
* mounts in that subset are
* present in their parents' lists of children
* do _NOT_ contribute to parents' refcounts
* do _NOT_ participate in event propagation
* removed from the lists of children and dropped upon
parent's final mntput (dropping being done on shallow stack, just before the
fs shutdown of parent).
I can buy that, *IF* one could add
* are guaranteed to be unreachable from any namespace root.
to the above. And yes, your logics around marking them is probably enough
for that. However, if we go that way, I don't understand why do we need
to involve MNT_LOCKED.
Look: your new rules for that subset are actually not that different from
the old ones - they *are* hashed, so by the old rules we'd added 1 to
refcount anyway. The only period when that is not true is from __detach_mnt()
in mntput_no_expire() to cleanup_mnt(). The real difference, AFAICS, is
that their ->mnt_parent does not contribute to refcount of its targets.
If the root of such a detached tree gets the last reference dropped,
it gets killed as usual *and* its children become roots of detached trees.
That is protected by mount hash lock (which prevents somebody inside a subtree
from traversing .. and grabbing a reference to root after we'd started to
detach stuff from it).
Fine, that makes sense, but IMO it would make a lot of sense to have a variant
of MNT_DETACH that would do that for *all* mounts, locked or not. I'd be
tempted to make MNT_DETACH itself do that, except that the current behaviour
(dropping eagerly) is useful in situations like "we have a stuck filesystem
and a bunch of stuff under it; let's do umount -l on that shit, hopefully
the non-busy filesystems mounted down there will get a clean shutdown out
I seriously dislike your use of path_put() on mnt_ex_mountpoint - it's
actively misleading. We bloody better _not_ have non-NULL
->mnt_ex_mountpoint.mnt in your cleanup_mnt(), for obvious reasons. So
it's just dput() + do something random if non-NULL mnt found.
Not sure if I like the use of mnt_child between mntput_no_expire() and
cleanup_mnt() - it's probably safe, but the fewer lists we modify outside of
mount hash lock, the better; hell knows, I'll need to stare at that code
a bit more. FWIW, AFAICS the refcount rules with your variant are
* external references countribute
* mnt_parent contributes unless it points to ourselves *or* mnt_ns is
* reachable from mount hash => add 1
* in addition to the wart around namespace_unlock(), we have a similar
wart between mntput_no_expire() and cleanup_mnt(), only there we thread the
suckers on mnt_child instead of mnt_list and use slightly different logics to
prevent shutdown of parent fs before the dput(mountpoint).
I really hate synchronize_rcu() in namespace_unlock() ;-/ Not your fault,
but it's a PITA every time one does analysis of all that code... It's
about legitimize_mnt() vs. umount() - we want to make sure that any
transient bumps of refcount from attempts to legitimize a lazy reference
to vfsmount getting killed by umount() will be undone *before* we get to
mntput_no_expire() in namespace_unlock(). It's only an issue for sync
umounts, so we are fine here, but it definitely needs commenting upon -
some vfsmounts *can* escape it with that approach, but they are not going
to be marked sync-umount, so we don't need to worry about them...
More information about the Containers