[PATCH review 4/6] mnt: Track when a directory escapes a bind mount
Eric W. Biederman
ebiederm at xmission.com
Fri Aug 14 04:10:26 UTC 2015
Al Viro <viro at ZenIV.linux.org.uk> writes:
> On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote:
>> - The escape count on struct mount must be incremented both before the
>> rename and after. If the count is not incremented before the rename
>> it is possible to hit a scenario where the rename happens the code
>> walks up the directory tree to somewhere outside of the bind mount
>> before the count is touched. Similary without a count after the
>> rename it is possible for the code to look at the escape count
>> validate a path is connected before the rename and assume cache the
>> escape count, leading to not retesting the path is ok.
> Umm... I wonder if you are overcomplicating the things here. Sure,
> I understand wanting to reduce the checks on "..", but... It costs you
> considerable complexity (especially when it comes to 64bit counts),
> it's really brittle (you need to be very careful about the places where
> you zero the cached values in fs/namei.c and missing one will lead to
> really unpleasant effects there) _and_ it's all for the benefit of
> a very rare case. With check you are optimizing away not being all that
> costly anyway.
I had to give this a long hard think. Algorithms going to O(N^2) when
it is uncessarry really bother me. I ran some numbers for really deep
directory trees, slow memory, etc and I could not come up with a
scenario where even in it's worst case d_ancestor would take anywhere
near as long as a one disk seek, and most of the d_ancestor would be
So it appears to me that in the worst case a pathname lookup consisting
of a ridiculous number of .. components starting with a cold cache, on a
mount where a directory has escaped is likely to be faster than a
similar lookup going down the tree with many disk seeks.
I don't think the 64bit counts and the zeroing the cache values are
quite as bad as you make out. There are much trickier things already in
path name lookup code. But I do agree that it is easy to get wrong
because nothing will show up in testing, and getting it wrong will have
really unpleasant effects.
I also can't see a scenario where a directory would escape a subtree
that is mounted somewhere without it being a misconfiguration.
So I agree it is not worth it to optimize the code so that there
are an absolute minimum number of d_ancestor calls during pathname
Further replacing mnt_escape_count with a mnt_flag makes the code much
simpler. Which I very much appreciate.
>> - The largest change is in d_unalias, where the two cases are split
>> apart so they can be handled separately. In the easy case of a
>> rename within the same directory all that is needed is __d_move
>> (escaping a mount is impossible in that case). In the more involved
>> case mutexes need to be acquired, and now the spin locks need to be
>> dropped so that proper lock aquisition order around __d_move can be
> I _really_ hate that part. Could you explain WTF is wrong with simply
> taking mount_lock in that case of __d_splice_alias() just outside of
Me too. So upon realizing the that inode->i_lock is held longer than
necessary in d_splice_alias I reworked the locking in d_splice_alias.
Updated patches to follow in a little bit.
> PS: that thing should be in fs/dcache.c, at least in the part that
> deals with finding the common ancestor, etc. And __d_move() (and
> dentry_lock_for_move()) games with d_ancestor() should be redundant
It does seem reasonable that the BUG_ONs in __d_move that call
d_ancestor can be removed, or simplified by passing the common ancestor
I don't know the code in dentry_lock_for_move well enough to say
anything except that the d_ancestor call in dentry_lock_for_move looks
Doing anything inside of __d_move or dentry_lock_for_move appears
to be a detour to the cause of preventing escaping from bind mounts.
So while I have no problems with the with the kinds of changes I hear
you suggesting, but unless I encounter something that makes changing
__d_move or dentry_lock_for_more relevant to the work of preventing
escaping from bind mounts I don't plan on touching them while that
is my focus.
More information about the Containers