[PATCH net-next] [RFC] netns: enable cross-ve Unix sockets

Thu Oct 2 13:03:04 PDT 2008

I do not believe we have yet met the burden of proof necessary to make the proposed
semantic change.

"Denis V. Lunev" <den at openvz.org> writes:

> On Wed, 2008-10-01 at 18:15 +0200, Daniel Lezcano wrote:
>> 
>>   1 - the current behaviour is full isolation. Shall we/can we change 
>> that without taking into account there are perhaps some people using 
>> this today ? I don't know.
> We have a direct request from people using to remove this state of
> isolation.

Users of a brittle but convenient hack not application developers.

Which means there is demand for this class of communication but it
doesn't mean that the suggested way is the right way.

>>   2 - I wish to launch a non chrooted application inside a namespace, 
>> sharing the file system without sharing the af_unix sockets, because I 
>> don't want the application running inside the container overlap with the 
>> socket af_unix of another container. I prefer to detect a collision with 
>> a strong isolation and handle it manually (remount some part of the fs 
>> for example).
> with common filesystem you have to detect collisions at least for FIFOs.
> This situation is the same. Basically, if we'll treat named Unix sockets
> as an improved FIFO - it's better to use the same approach

There are two aspects to this.

1) Looking up the proxy object in the filesystem.

   With that I agree that the file system access rules are sufficient.
   If we don't want the possibility of proxy objects you simply configure
   the system so that there are no shared files between namespaces.

2) How we interpret the proxy object.

   Currently the shared filesystem aka NFS precedent is that you can see
   the unix domain socket but not use it.  Just what we have implemented now.

As for FIFO you are be right that there is a potential bug there.  Likely those
fall in the yet to be implemented device namespace.  It is certainly an area
of the code we have not audited and thought about, so there is no precedent
set.

>>   3 - I would like to be able to reduce this isolation (your point) to 
>> share the af_unix socket for example to use /dev/klog or something else.
>> 
>> I don't know how much we can consider the point 1, 2 pertinent, but 
>> disabling 3 lines of code via a sysctl with strong isolation as default 
>> and having a process unsharing the namespace in userspace and changing 
>> this value to less isolation is not a big challenge IMHO :)
> the real questions is _who_ is responsible for this kind of staff ->
> node (parent container) administrator or container administrator. I
> strongly vote for first.
>
> Also if we are talking about such kind of staff, I dislike global
> kludge. This should be a property of two concrete VEs and better two
> concrete sockets. Unfortunately, setsockopt is not an option :(

We support sockets from different namespaces in a single process.

I agree that to keep the hack working we can not use setsockopt.
I don't agree that we want to keep the hack working.

- The code needs an audit to think about what it means to exchange packets
  between unix domain sockets in different network namespaces.  There is
  surprising a lot of code in veth to accommodate that.

  If we have too many places where we need to do something strange it is
  going to make code maintenance difficult.

- The semantics get stranger with respect to interpreting unix domain socket
  proxy objects.   Sometimes we can connect to someplace non-local and sometimes
  we can't.

- We can do this without changing the semantics of how socket proxy objects
  are interpreted, as it is possible for a process to use sockets in two different
  network namespaces.  That requires application level changes.

- It is prohibitively difficult to implement unix domain sockets that talk between
  different kernels (file descriptors sharing offsets and garbage collection of in
  flight sockets ouch!).  This means that encouraging the use of unix domain sockets
  to transparently connect applications in different containers is probably a bad
  idea.  This is completely different from FIFO's that are simple enough it is not
  hard to relay data between machines and get it right.

- I don't know how much using unix domain sockets to talk between different domains
  transparently will confuse applications, and increase the risk of security exploits.
  I don't think it will be much though as we already have unix permissions checking
  from the proxy object, and we should be handling the namespace transitions of passed
  objects in the code that sends credentials and file descriptors.

Eric