No subject
Wed Apr 30 13:12:50 PDT 2008
To do that we don't need uid-mapping in the kernel, all we need
is a concept of user_structs owning user_namespaces they created.
Then the default vfs_permission becomes:
1. if (user->uid==inode->uid && user->uidns==vfsmount->userns)
treat as owner
2. if (user->uidns==vfsmount->userns)
check groups
3. if (vfsmount->userns is in user->child_namespaces)
treat as uid 0
4. treat as nobody
But as I was laying that out and trying define sane semantics,
the following obvious shortcoming sprang out. It would have
been obvious if I'd given some fs semantics requirements right
at the top:
Let's use X:Y to describe uid X in userns Y. Let's assume
the behavior described above, and that we tag vfsmounts with
the user_namespace of the user_struct whose task performed
the mount.
When user 500:1 creates a container with uidns2, wherein he
uses uids 0:2 and 400:2, then:
1. files belonging to 500:1 should be treated no
differently than files belonging to any other X:1.
The container init can mount --bind it's / early
on using user nobody permissions, so this is sufficient.
2. files created by 0:2 should be owned by 0:2 in
the container.
BUT
3. files created by 0:2 should not be owned by uid 0
in the parent container (0:1).
4. when a task executes a file owned by 0:2 or a
file owned by X:2 carrying file capabilities, the
resulting task should carry privilege over objects
in userns 2, not over objects in userns 1.
So simple tagging of vfsmounts can only suffice if we insist
on tagging newly created files with a uid in the initial
user_namespace. Then to do anything fancier - really, to do
anything sufficient for system containers - we'd have to
use one of the other things we described above - nfsv4, or
a mostly-pass-through stackable fs, or whatever.
-serge
More information about the Containers
mailing list