[PATCH ghak90 V9 11/13] audit: contid check descendancy and nesting

Fri Aug 7 17:10:25 UTC 2020

On 2020-07-05 11:11, Paul Moore wrote:
> On Sat, Jun 27, 2020 at 9:23 AM Richard Guy Briggs <rgb at redhat.com> wrote:
> > Require the target task to be a descendant of the container
> > orchestrator/engine.
> >
> > You would only change the audit container ID from one set or inherited
> > value to another if you were nesting containers.
> >
> > If changing the contid, the container orchestrator/engine must be a
> > descendant and not same orchestrator as the one that set it so it is not
> > possible to change the contid of another orchestrator's container.

Are we able to agree on the premises above?  Is anything asserted that
should not be and is there anything missing?

I've been sitting on my response below for more than a week trying to
understand the issues raised and to give it the proper attention to a
reply.  Please excuse my tardiness at replying on this issue since I'm
still having trouble thinking through all the scenarios for nesting.

> > Since the task_is_descendant() function is used in YAMA and in audit,
> > remove the duplication and pull the function into kernel/core/sched.c
> >
> > Signed-off-by: Richard Guy Briggs <rgb at redhat.com>
> > ---
> >  include/linux/sched.h    |  3 +++
> >  kernel/audit.c           | 23 +++++++++++++++++++++--
> >  kernel/sched/core.c      | 33 +++++++++++++++++++++++++++++++++
> >  security/yama/yama_lsm.c | 33 ---------------------------------
> >  4 files changed, 57 insertions(+), 35 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 2213ac670386..06938d0b9e0c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2047,4 +2047,7 @@ static inline void rseq_syscall(struct pt_regs *regs)
> >
> >  const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> >
> > +extern int task_is_descendant(struct task_struct *parent,
> > +                             struct task_struct *child);
> > +
> >  #endif
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index a862721dfd9b..efa65ec01239 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2713,6 +2713,20 @@ int audit_signal_info(int sig, struct task_struct *t)
> >         return audit_signal_info_syscall(t);
> >  }
> >
> > +static bool audit_contid_isnesting(struct task_struct *tsk)
> > +{
> > +       bool isowner = false;
> > +       bool ownerisparent = false;
> > +
> > +       rcu_read_lock();
> > +       if (tsk->audit && tsk->audit->cont) {
> > +               isowner = current == tsk->audit->cont->owner;
> > +               ownerisparent = task_is_descendant(tsk->audit->cont->owner, current);
> 
> I want to make sure I'm understanding this correctly and I keep
> mentally tripping over something: it seems like for a given audit
> container ID a task is either the owner or a descendent, there is no
> third state, is that correct?

Sure there is.  It could be another owner (which is addressed when we
search for an existing contobj match), or in the next patch, the
owner's parent if nested or a peer.

> Assuming that is true, can the descendent check simply be a negative
> owner check given they both have the same audit container ID?

There isn't actually a check in my code for the orchestrator contid and
task contid being the same.  Maybe I was making this check more
complicated than necessary, and still incomplete, but see below for more...

> > +       }
> > +       rcu_read_unlock();
> > +       return !isowner && ownerisparent;
> > +}
> > +
> >  /*
> >   * audit_set_contid - set current task's audit contid
> >   * @task: target task
> > @@ -2755,8 +2769,13 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> >                 rc = -EBUSY;
> >                 goto unlock;
> >         }
> > -       /* if contid is already set, deny */
> > -       if (audit_contid_set(task))
> > +       /* if task is not descendant, block */
> > +       if (task == current || !task_is_descendant(current, task)) {
> 
> I'm also still fuzzy on why we can't let a task set it's own audit
> container ID, assuming it meets all the criteria established in patch
> 2/13.  It somewhat made sense when you were tracking inherited vs
> explicitly set audit container IDs, but that doesn't appear to be the
> case so far in this patchset, yes?

I'm still having a strong reluctance to permit this but can't come up
with a solid technical reason right now, but it feels like a layer
violation.  If we forbid it and discover it necessary and harmless, then
permitting it won't break the API.  If we permit it and later discover a
reason it causes a problem, then blocking it will break the API.  I have
heard that there are cases where there is no orchestrator/engine, so in
those cases I conclude that a process would need to set its own contid
but I'm having trouble recalling what those circumstances are.

I also was seriously considering blocking any contid set on the initial
user or PID namespaces to avoid polluting them, and even had a tested
patch to implement it, but this starts making assumptions about the
definition of a container with respect to namespaces which we have been
deliberately avoiding.

> > +               rc = -EXDEV;
> 
> I'm fairly confident we had a discussion about not using all these
> different error codes, but that may be a moot point given my next
> comment.

Yes, we did.  I reduced both circumstances down to what you requested,
shedding two along the way.  Given the number of different ways
orchestrators, contids and tasks can be related, I'd rather have more,
not fewer diagnostics to understand what it thinks is happenning.  This
is a realtively minor detail in the context of the rest of the
discussion in this thread.

> > +               goto unlock;
> > +       }
> > +       /* only allow contid setting again if nesting */
> > +       if (audit_contid_set(task) && !audit_contid_isnesting(task))
> >                 rc = -EEXIST;
> 
> It seems like what we need in audit_set_contid() is a check to ensure
> that the task being modified is only modified by the owner of the
> audit container ID, yes?  If so, I would think we could do this quite
> easily with the following, or similar logic, (NOTE: assumes both
> current and tsk are properly setup):
> 
>   if ((current->audit->cont != tsk->audit->cont) || (current->audit->cont->owner != current))
>     return -EACCESS;

Not necessarily.

If we start from the premise that once set, a contid on a task cannot be
unset, and then that it cannot be set to another value, then the oldest
ancestor in any container must not be able to change contid.  That
leaves any descendant (that hasn't threaded or parented) free to nest.

If we allow a task to modify its own contid (from the potential change
above), then if it inherited its contid, it could set its own.  This
still looks like a layer violation to me.  Going back to some
discussions with Eric Biederman from a number of years ago, it seems
wrong to me that a task should be able to see its own contid, let alone
be able to set it.  This came out of a CRIU concern about serial nsIDs
based on proc inode numbers not being portable.  Is it still a
consideration?

Another scenario comes to mind.  Should an orchestrator be able to set
the contid of a descendant of one of the former's child orchestrators?
This doesn't sound like a good idea leaping generations and I can't come
up with a valid use case.

> This is somewhat independent of the above issue, but we may also want
> to add to the capability check.  Patch 2 adds a
> "capable(CAP_AUDIT_CONTROL)" which is good, but perhaps we also need a
> "ns_capable(CAP_AUDIT_CONTROL)" to allow a given audit container ID
> orchestrator/owner the ability to control which of it's descendants
> can change their audit container ID, for example:
> 
>   if (!capable(CAP_AUDIT_CONTROL) ||
>       !ns_capable(current->nsproxy->user_ns, CAP_AUDIT_CONTROL))
>     return -EPERM;

Why does ns_capable keep being raised?  The last patch, capcontid, was
developed to solve this previously raised issue.  The issue was an
unprivileged user creating a user namespace with full capabilities,
circumventing capable() and being able to change the main audit
configuration.  It was already discussed in v8 and before that and my
last posting in the thread was left dangling with an unanswered
question:
https://lkml.org/lkml/2020/2/6/333

I only see this being potentially useful with audit namespaces in
conjunction with unprivileged user namespaces in the future with the
implementation of multiple audit daemons for the ability of an
unprivileged user to run their own distro container without influencing
the master audit configuration.

> paul moore

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635