[RFC PATCH v1 4/4] Allow to change the user namespace in which user rlimits are counted

Jann Horn jannh at google.com
Mon Nov 2 17:10:06 UTC 2020


On Mon, Nov 2, 2020 at 5:52 PM Alexey Gladkov <gladkov.alexey at gmail.com> wrote:
> Add a new prctl to change the user namespace in which the process
> counter is located. A pointer to the user namespace is in cred struct
> to be inherited by all child processes.
[...]
> +       case PR_SET_RLIMIT_USER_NAMESPACE:
> +               if (!capable(CAP_SYS_RESOURCE))
> +                       return -EPERM;
> +
> +               switch (arg2) {
> +               case PR_RLIMIT_BIND_GLOBAL_USERNS:
> +                       error = set_rlimit_ns(&init_user_ns);
> +                       break;
> +               case PR_RLIMIT_BIND_CURRENT_USERNS:
> +                       error = set_rlimit_ns(current_user_ns());
> +                       break;
> +               default:
> +                       error = -EINVAL;
> +               }
> +               break;

I don't see how this can work. capable() requires that
current_user_ns()==&init_user_ns, so you can't use this API to bind
rlimits to any other user namespace.

Fundamentally, if it requires CAP_SYS_RESOURCE, this probably can't be
done as an API that a process uses to change its own rlimit scope. In
that case I would implement this as part of clone3() instead of
prctl(). (Then init_user_ns can set it if the caller has
CAP_SYS_RESOURCE. If you want to have support for doing the same thing
with nested namespaces, you'd also need a flag that the first-level
clone3() can set on the namespace to say "further rlimit splitting
should be allowed".)

Or alternatively, we could say that CAP_SYS_RESOURCE doesn't matter,
and instead you're allowed to move the rlimit scope if your current
hard rlimit is INFINITY. That might make more sense? Maybe?


More information about the Containers mailing list