[REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point

Fri Jan 17 08:39:01 UTC 2014

On Thu, Jan 16, 2014 at 07:29:16PM -0800, Eric W. Biederman wrote:

> However it would be nice to see:
> 28 13 0:3 net:[4026532127] /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw
> 
> Not having the path be the magical floating path is not causing
> regressions for people so I don't care much at the moment.

> >> Any dentry allocated with d_alloc_pseudo will have the same problem if
> >> it is ever in a context where it can be bind mounted.  So this is a
> >> general issue with d_name.
> >
> > Yes, ->d_dname() is called in a wrong place.  It should be in prepend_path
> > if not deeper.  What you are doing is growing a kludge on top of that
> > mistake.
> 
> I disagree. I am not growing a kludge on of a mistake.  I am refining
> the logic at the call site of d_dname.  The logic to not call d_dname
> on a mountpoint should be there wherever we call d_dname.

What, in your opinion does the word "kludge" mean?  The usual meaning is
"a change that might work, but depends on too many easy-to-break
accidental details of the current construction".  If you want to use
a different term, fine, but the problem doesn't disappear from renaming
things.

I agree that your patch doesn't break things and I've said so from the
very beginning.

[snip the obvious analysis]

> At a practical level there are improvements to be had by calling d_dname
> in dentry_path and by adding my avoidance of calling d_dname on a
> mountpoint into tomoyo_realpath_from_path.

... i.e. making things even more convoluted.  Congratulations, that's
just what we need.

> So while I see what you mean by prepend_path needing cleaning up that is
> really orthogonal to the acidental regression that I am fixing.
> 
> > As for redesign...  That's exactly what is needed in that area, instead of
> > letting it fester.
> 
> I can't backport a redesign to fix the regression, and a redesign
> results in no practical benefit for me.  So I might be persuaded later
> if I can find the time but right now a redesign is the wrong place to
> start.

Gotta love the style, but... I'm not persuaded by the above.  At all.

> > As for check_mnt(old) in do_loopback()...  I wonder if we shouldn't just
> > turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)).
> > Note that MS_NOUSER in superblock flags would prevent binding pipefs and
> > all mount_pseudo() users, so we wouldn't change existing behaviour...
> 
> Strictly speaking behavior would change for the proc namespace files, as
> now you could do things by finding a mount of proc in another mount
> namespace where the namespace files were available.

Er...  Yes, and?  If you can do that, you can bloody well just open them
there and then do exact same thing.

The visible change would be different - it would be the ability to bind
hugetlbfs and shmem anon mappings via /proc/*/map_files/*

> I would be a lot more comfortable with changing do_loopback if we could
> safely remove the check_mnt(old) test altogether.  Why do we have the
> check_mnt(old) test in do_loopback?   If we can't remove the
> check_mnt(old) test I need verify that the reasons for that test don't
> apply to my namespace file descriptor case.  Otherwise I can't see any
> real problems with making that change (time permitting).

Sigh...  You do realize that right now your proc_ns_follow_link() violates
a pretty basic assert: nd->path.mnt->mnt_sb == nd->path.dentry->d_sb,
right?  IOW, that dentry belongs to the filesystem pointed to be vfsmount.

At ->follow_link() time we have two objects to keep track of: where we are
in pathname resolution (i.e. which directory had we reached) and what
symlink are we resolving.  nd->path points to the former; the first
argument of ->follow_link() - to the latter.  _Usually_ the parent is
on the same vfsmount as the symlink, so your shortcut doesn't blow up
all the time.  However, that is not guaranteed - it *is* possible to
bind a symlink.  Not conveniently at the moment, but it can be done and
there's no fundamental reasons why it should be forbidden.  And in that
setup proc_ns_follow_link() is in trouble.

The reasons for check_mnt() there are, IIRC, more about playing silly buggers
with propagation trees from some other namespace.  IOW, they do not apply
to cloning stuff off the internal vfsmounts.

/proc/*/map_files/* is potentially a problem, though - being able to
create a binding to something on hugetlbfs might have interesting implications.