[CFT][PATCH 2/3] userns: Add a knob to disable setgroups on a per user namespace basis

Tue Dec 2 22:17:17 UTC 2014

On Tue, Dec 2, 2014 at 1:45 PM, Eric W. Biederman <ebiederm at xmission.com> wrote:
> Andy Lutomirski <luto at amacapital.net> writes:
>
>> On Tue, Dec 2, 2014 at 12:28 PM, Eric W. Biederman
>> <ebiederm at xmission.com> wrote:
>>>
>>> - Expose the knob to user space through a proc file /proc/<pid>/setgroups
>>
>> Can you rename this to something clearer, e.g. userns_setgroups_mode?
>
> I am not certain that is any clearer.  That just reads as wordier.
>
> The userns bit is definitely confusing and wrong.  Why should we need to
> spell out the scope?

Because it's a property of the process' userns, not a property of the process.

userns_setgroups would be okay.  (Arguably it should've been
userns_uid_map, too, but at least uid_map sounds obviously
namespace-related.)

>
>>>   A value of 0 means the setgroups system call is disabled in the
>>>   current processes user namespace and can not be enabled in the
>>>   future in this user namespace.
>>>
>>>   A value of 1 means the segtoups system call is enabled.
>>
>> Would it make more sense to put strings like "allow" and "deny" in
>> here?  That way, future extensions could add additional values.
>
> If the implementation of the write side isn't too bad.  I would love
> to see precedent elsewhere in the kernel.    What I don't expect to do
> is have any values except setgroups are enabled and setgroups are
> disabled.

current_clocksource?  I think there are lots of things like that.

>
> I am fine with allowing for the possibility (that is just good
> engineering) but I don't intend to seriously considering or
> implementing other possibilities.

I can imagine a mode where there's a fixed set of groups that are
forced set but other groups can be added and dropped.  It would be
complicated to set up right, but someone might want it some day.

>
>>> diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
>>> index 21c91feeca2d..6d0ee1b089fb 100644
>>> --- a/arch/s390/kernel/compat_linux.c
>>> +++ b/arch/s390/kernel/compat_linux.c
>>> @@ -252,6 +252,7 @@ COMPAT_SYSCALL_DEFINE2(s390_setgroups16, int, gidsetsize, u16 __user *, grouplis
>>>         int retval;
>>>
>>>         if (!gid_mapping_possible(user_ns) ||
>>> +           !atomic_read(&user_ns->setgroups_allowed) ||
>>>             !capable(CAP_SETGID))
>>>                 return -EPERM;
>>
>> This is now incomprehensible because of the gid_mapping_possible
>> thing.  If you renamed gid_mapping_possible to
>> userns_setgroup_allowed, then this could be added to the
>> implementation, and this would all make sense (not to mention avoiding
>> duplicating this thing).
>>
>>> @@ -826,6 +827,11 @@ static bool new_idmap_permitted(const struct file *file,
>>>                         kuid_t uid = make_kuid(ns->parent, id);
>>>                         if (uid_eq(uid, cred->euid))
>>>                                 return true;
>>> +               } else if (cap_setid == CAP_SETGID) {
>>> +                       kgid_t gid = make_kgid(ns->parent, id);
>>> +                       if (!atomic_read(&ns->setgroups_allowed) &&
>>> +                           gid_eq(gid, cred->egid))
>>> +                               return true;
>>
>> I still don't see why egid is any better than fsgid here.
>
> Answered in my earlier response fsgid was a goof.
> I can set any gid to my egid with my existing permissions.
> Show me how I can do that with fsgid or fsuid and I will be happy to use
> those.

You can use your fsgid to make a setgid file, I think.  But yes, point
taken, although as mentioned in the other thread, I think it would be
a lot clearer if it were a separate patch.

>>> +
>>> +ssize_t proc_setgroups_write(struct file *file, const char __user *buf,
>>> +                            size_t count, loff_t *ppos)
>>> +{
>>> +       struct seq_file *seq = file->private_data;
>>> +       struct user_namespace *ns = seq->private;
>>> +       char kbuf[3];
>>> +       int setgroups_allowed;
>>> +       ssize_t ret;
>>> +
>>> +       ret = -EPERM;
>>> +       if (!file_ns_capable(file, ns, CAP_SETGID))
>>> +               goto out;
>>
>> CAP_SYS_ADMIN?  This isn't setting a gid in the namespace; it's
>> reconfiguring the namespace.
>
> Hmm.  Maybe.  It is an activity that is normally controlled by
> CAP_SETGID.
>
> Frankly I think the entire split up of the capability model is almost
> totally broken.  But I think CAP_SETGID is a close approximation of the
> right thing here.

I agree that the cap model is screwy.  But we use CAP_SYS_ADMIN for
everything else that changes the overall behavior of a namespace.

In any event, the only way it matters is for a non-ns owner in the
parent ns with CAP_SETGID set but not CAP_SYS_ADMIN.  I'd argue that
CAP_SETGID should not be usable to make unrelated process' syscalls
fail.

>
>>> +       /* Only allow a very narrow range of strings to be written */
>>> +       ret = -EINVAL;
>>> +       if ((*ppos != 0) || (count >= sizeof(kbuf)) || (count < 1))
>>> +               goto out;
>>> +
>>> +       /* What was written? */
>>> +       ret = -EFAULT;
>>> +       if (copy_from_user(kbuf, buf, count))
>>> +               goto out;
>>> +       kbuf[count] = '\0';
>>> +
>>> +       /* What is being requested? */
>>> +       ret = -EINVAL;
>>> +       if (kbuf[0] == '0')
>>> +               setgroups_allowed = 0;
>>> +       else if (kbuf[0] == '1')
>>> +               setgroups_allowed = 1;
>>> +       else
>>> +               goto out;
>>> +
>>> +       /* Allow a trailing newline */
>>> +       ret = -EINVAL;
>>> +       if ((count == 2) && (kbuf[1] != '\n'))
>>> +               goto out;
>>> +
>>> +
>>> +       if (setgroups_allowed) {
>>> +               ret = -EINVAL;
>>> +               if (atomic_read(&ns->setgroups_allowed) == 0)
>>> +                       goto out;
>>> +       } else {
>>
>> I would disallow this if gid_map has been written in the interest of
>> sanity.
>
> Not a chance.  That is part of making this an independent knob.  If
> there is another reason for disabling setgroups you can flip this
> knob even after mappings are established.

Then you really want CAP_SYS_ADMIN, I think.

>
>>> +               atomic_set(&ns->setgroups_allowed, 0);
>>> +               /* sigh memory barriers! */
>>
>> I don't think that any barriers are needed.  If you ever observe
>> setgroups_allowed == 0, it will stay that way forever.
>
> Likely.   In practice the code works today.
>
> But I need to review things closely to understand if there are barriers
> needed.  But especially since it is a write once knob we can get away
> with a lot.
>

Yeah.

For long-term use, I kind of like the flags approach I added in the
other patch.  It makes it easy to add more flags in the future.

In any event, I think the only barrier that's needed is when writing gid_map:

atomic_read / test_bit to determine whether unpriv mappings are okay;
smp_mb() or whatever the current _after_atomic thing is these days;
write mapping;

Although I'm not sure whether Linux supports any architectures that
can violate causality in the way that barrier is there to prevent.

--Andy