RFC: making cn_proc work in {pid,user} namespaces

Aleksa Sarai asarai at suse.de
Sun Oct 15 10:05:49 UTC 2017


Hi all,

At the moment, cn_proc is not usable by containers or container 
runtimes. In addition, all connectors have an odd relationship with 
init_net (for example, /proc/net/connectors only exists in init_net). 
There are two main use-cases that would be perfect for cn_proc, which is 
the reason for me pushing this:

First, when adding a process to an existing container, in certain modes 
runc would like to know that process's exit code. But, when joining a 
PID namespace, it is advisable[1] to always double-fork after doing the 
setns(2) to reparent the joining process to the init of the container 
(this causes the SIGCHLD to be received by the container init). It would 
also be useful to be able to monitor the exit code of the init process 
in a container without being its parent. At the moment, cn_proc doesn't 
allow unprivileged users to use it (making it a problem for user 
namespaces and "rootless containers"). In addition, it also doesn't 
allow nested containers to use it, because it requires the process to be 
in init_pid. As a result, runc cannot use cn_proc and relies on SIGCHLD 
(which can only be used if we don't double-fork, or keep around a 
long-running process which is something that runc also cannot do).

Secondly, there are/were some init systems that rely on cn_proc to 
manage service state. From a "it would be neat" perspective, I think it 
would be quite nice if such init systems could be used inside 
containers. But that requires cn_proc to be able to be used as an 
unprivileged user and in a pid namespace other than init_pid.

The /proc/net/connectors thing is quite easily resolved (just make it 
the connector driver perdev and make some small changes to make sure the 
interfaces stay sane inside of a container's network namespace). I'm 
sure that we'll probably have to make some changes to the registration 
API, so that a connector can specify whether they want to be visible to 
non-init_net namespaces.

However, the cn_proc problem is a bit harder to resolve nicely and there 
are quite a few interface questions that would need to be agreed upon. 
The basic idea would be that a process can only get cn_proc events if it 
has ptrace_may_access rights over said process (effectively a forced 
filter -- which would ideally be done send-side but it looks like it 
might have to be done receive-side). This should resolve possible 
concerns about an unprivileged process being able to inspect (fairly 
granular) information about the host. And obviously the pids, uids, and 
gids would all be translated according to the receiving process's user 
namespaces (if it cannot be translated then the message is not 
received). I guess that the translation would be done in the same way as 
SCM_CREDENTIALS (and cgroup.procs files), which is that it's done on the 
receive side not the send side.

My reason for sending this email rather than just writing the patch is 
to see whether anyone has any solid NACKs against the use-case or 
whether there is some fundamental issue that I'm not seeing. If nobody 
objects, I'll be happy to work on this.

[1]: https://lwn.net/Articles/532748/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/


More information about the Containers mailing list