[PATCH v7 5/5] namei: resolveat(2) syscall

Aleksa Sarai cyphar at cyphar.com
Sun May 26 05:46:38 UTC 2019


On 2019-05-25, Linus Torvalds <torvalds at linux-foundation.org> wrote:
> In fact, I think resolveat() as a model is fundamentally wrong for yet
> another reason: O_CREAT. If you want to _create_ a new file, and you
> want to still have the path resolution modifiers in place, the
> resolveat() model is broken, because it only gives you path resolution
> for the lookup, and then when you do openat(O_CREAT) for the final
> component, you now don't have any way to limit that last component.
> 
> Sure,  you can probably effectively hack around it with resolveat() on
> everything but the last component, and then
> openat(O_CREAT|O_EXCL|O_NOFOLLOW) on the last component, because the
> O_EXCL|O_NOFOLLOW should mean that it won't do any of the unsafe
> things. And then (if you didn't actually want the O_EXCL), you handle
> the race between "somebody else got there first" by re-trying etc. So
> I suspect the O_CREAT thing could be worked around with extra
> complexity, but it's an example of how the O_PATH model really screws
> you over.
> 
> End result: I really think resolveat() is broken. It absolutely
> *needs* to be a full-fledged "openat()" system call, just with added
> path resolution flags.

That's a very good point. I was starting to work on O_CREAT via
resolveat(2) and it definitely was much harder than most people would be
happy to deal with -- I'm not even sure that I handled all the cases.

I'll go for an openat2(2) then. Thinking about it some more -- since
it's a new syscall, I could actually implement the O_PATH link-mode as
being just the regular mode argument (since openat(2) ignores the mode
for non-O_CREAT anyway). The openat(2) wrapper might be more than
one-line as a result but it should avoid polluting the resolution flags
with mode flags (since openat(O_PATH) needs to have a g+rwx mode for
backwards-compatibility).

> >   openat2(rootfd, "path/to/file", O_PATH, RESOLVE_IN_ROOT, &statbuf);
> 
> Note that for O_CREAT, it either needs the 'mode' parameter too, or
> the statbuf would need to be an in-out thing. I think the in-out model
> might be nice (the varargs model with a conditional 'mode' parameter
> is horrid, I think), but at some point it's just bike-shedding.
> 
> Also, I'm not absolutely married to the statbuf, but I do think it
> might be a useful extension. A *lot* of users need the size of the
> file for subsequent mmap() calls (or for buffer management for
> read/write interface) or for things like just headers (ie
> "Content-length:" in html etc).
> 
> I'm not sure people actually want a full 'struct stat', but people
> historically also do st_ino/st_dev for indexing into existing
> user-space caches (or to check permissions like that it's on the right
> filesystem, but the resolve flags kind of make that less of an issue).
> And st_mode to verify that it's a proper regular file etc etc.
> 
> You can always just leave it as NULL if you don't care, there's almost
> no downside to adding it, and I do think that people who want a "walk
> pathname carefully" model almost always would want to also check the
> end result anyway.

Yeah, I agree -- most folks would want to double-check what they've
opened or O_PATH'd. Though I'm still not clear what is the best way of
doing the "stat" argument is -- especially given how much fun
architecture-specific shenanigans are in fs/stat.c (should I only use
cp_new_stat64 or have a separate 64-bit syscall).

> > Is there a large amount of overhead or downside to the current
> > open->fstat way of doing things, or is this more of a "if we're going to
> > add more ways of opening we might as well add more useful things"?
> 
> Right now, system calls are sadly very expensive on a lot of hardware.
> We used to be very proud of the fact that Linux system calls were
> fast, but with meltdown and retpoline etc, we're back to "system calls
> can be several thousand cycles each, just in overhead, on commonly
> available hardware".
> 
> Is "several thousand cycles" fatal? Not necessarily. But I think that
> if we do add a new system call, particularly a fancy one for special
> security-conscious models, we should look at what people need and use,
> and want. And performance should always be a concern.
> 
> I realize that people who come at this from primarily just a security
> issue background may think that security is the primary goal. But no.
> Security always needs to realize that the _primary_ goal is to have
> people _use_ it. Without users, security is entirely pointless. And
> the users part is partly performance, but mostly "it's convenient".

Yup, I agree. I was hoping to shunt most of the convenience to userspace
to avoid ruffling feathers in VFS-land, but I'm more than happy to make
the kernel ABI more convenient.

> The whole "this is Linux-specific" is a big inconvenience point

I hope that other OSes will take our lead and have a similar interface
so that the particular inconvenience can go away eventually (this was
one of the arguments for resolveat(2) -- it's a clear "this is a new
idea" interface rather than mixing it with other O_* flags).

> Talking about securely opening things - another flag that we may want
> to avoid issues is a "don't open device nodes" flag. Sure, O_NONBLOCK
> plus checking the st_mode of the result is usually sufficient, but
> it's actually fairly easy to get that wrong. Things like /dev/tty and
> /dev/zero often need to be available for various reasons, and have
> been used to screw careless "open and read" users up that missed a
> check.

Sure, this could be added -- though I'm sure folks would have
disagreements over whether it should be a resolution flag on an open
flag.

> I also do wonder that if the only actual user-facing interface for the
> resolution flags is a new system call, should we not make the
> *default* value be "don't open anything odd at all".
>
> So instead of saying RESOLVE_XDEV for "don't cross mount points",
> maybe the flags should be the other way around, and say "yes, allow
> mount point crossings", and "yes, explicitly allow device node
> opening", and "yes, allow DOTDOT" etc.

This seems like a reasonable default, except for the cases of
RESOLVE_BENEATH and RESOLVE_IN_ROOT (that would make using AT_FDCWD more
cumbersome than necessary). But other than that, I'd be happy to invert
all the other flags.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20190526/b5668779/attachment.sig>


More information about the Containers mailing list