RFC: Audit Kernel Container IDs

Wed Sep 13 19:33:52 UTC 2017

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> Containers are a userspace concept.  The kernel knows nothing of them.

I am looking at this RFC from a userspace perspective, particularly from
the loader's point of view and the unshare syscall and the semantics that
arise from the use of it.

At a high level what you are doing is providing a way to group, without
hierarchy, processes and namespaces. The processes can move between
container's if they have CAP_CONTAINER_ADMIN and can open and write to
a special proc file.

* With unshare a thread may dissociate part of its execution context and
  therefore see a distinct mount namespace. When you say "process" in this
  particular RFC do you exclude the fact that a thread might be in a
  distinct container from the rest of the threads in the process?

> The Linux audit system needs a way to be able to track the container
> provenance of events and actions.  Audit needs the kernel's help to do
> this.

* Why does the Linux audit system need to tracker container provenance?

  - How does it help to provide better audit messages?

  - Is it be enough to list the namespace that a process occupies?

* Why does it need the kernel's help?

  - Is there a race condition that is only fixable with kernel support?

  - Or is it easier with kernel help but not required?

Providing background on these questions would help clarify the
design requirements.

> Since the concept of a container is entirely a userspace concept, a
> trigger signal from the userspace container orchestration system
> initiates this.  This will define a point in time and a set of resources
> associated with a particular container with an audit container ID.

Please don't use the word 'signal', I suggest 'register' since you are
writing to a filesystem.

> The trigger is a pseudo filesystem (proc, since PID tree already exists)
> write of a u64 representing the container ID to a file representing a
> process that will become the first process in a new container.
> This might place restrictions on mount namespaces required to define a
> container, or at least careful checking of namespaces in the kernel to
> verify permissions of the orchestrator so it can't change its own
> container ID.
> A bind mount of nsfs may be necessary in the container orchestrator's
> mntNS.
> 
> Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> filesystem to have this action permitted.  At that time, record the
> child container's user-supplied 64-bit container identifier along with

What is a "child container?" Containers don't have any hierarchy.

I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
your continued operation as we have today?

> the child container's first process (which may become the container's
> "init" process) process ID (referenced from the initial PID namespace),
> all namespace IDs (in the form of a nsfs device number and inode number
> tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
> op=$action field.

What kind of requirement is there on the first tid/pid registering
the container ID? What if the 8th tid/pid does the registration?
Would that mean that the first process of the container did not
register? It seems like you are suggesting that the registration
by the 8th tid/pid causes a cascading registration progress,
registering all tid/pids in the same grouping? Is that true?

> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> container ID present on an auditable action or event.
> 
> Forked and cloned processes inherit their parent's container ID,
> referenced in the process' audit_context struct.

So a cloned process with CLONE_NEWNS has the came container ID
as the parent process that called clone, at least until the clone
has time to change to a new container ID?

Do you forsee any case where someone might need a semantic that is
slightly different? For example wanting to set the container ID on
clone?

> Log the creation of every namespace, inheriting/adding its spawning
> process' containerID(s), if applicable.  Include the spawning and
> spawned namespace IDs (device and inode number tuples).
> [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> Note: At this point it appears only network namespaces may need to track
> container IDs apart from processes since incoming packets may cause an
> auditable event before being associated with a process.

OK.

> Log the destruction of every namespace when it is no longer used by any
> process, include the namespace IDs (device and inode number tuples).
> [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
> 
> Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> the parent and child namespace IDs for any changes to a process'
> namespaces. [setns(2)]
> Note: It may be possible to combine AUDIT_NS_* record formats and
> distinguish them with an op=$action field depending on the fields
> required for each message type.
> 
> A process can be moved from one container to another by using the
> container assignment method outlined above a second time.

OK.

> When a container ceases to exist because the last process in that
> container has exited and hence the last namespace has been destroyed and
> its refcount dropping to zero, log the fact.
> (This latter is likely needed for certification accountability.)  A
> container object may need a list of processes and/or namespaces.

OK.

> A namespace cannot directly migrate from one container to another but
> could be assigned to a newly spawned container.  A namespace can be
> moved from one container to another indirectly by having that namespace
> used in a second process in another container and then ending all the
> processes in the first container.

OK.

> Feedback please.

-- 
Cheers,
Carlos.