[RFC][PATCH] Improve NFS use of network and mount namespaces

Tue May 12 16:46:50 PDT 2009

On Tue, 2009-05-12 at 14:51 -0700, Matt Helsley wrote:
> Sun RPC currently opens sockets from the initial network namespace making it
> impossible to restrict which NFS servers a container may interact with.
> 
> For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> will always be used even if an entirely different server with the address
> 10.0.0.3 is reachable from a container's network namespace. Hence network
> namespaces cannot be used to restrict the network access of a container as long
> as the RPC code opens sockets using the initial network namespace. This is
> in stark contrast to other protocols like HTTP where the sockets are created in
> their proper namespaces because kernel threads are not used to open sockets for
> client network IO.
> 
> We may plausibly end up with namespaces created by:
> I) The administrator may mount 10.0.0.3:/export_foo from init's
> container, clone the mount namespace, and unmount from the original
> mount namespace.
> 
> II) The administrator may start a task which clones the mount namespace
> before mounting 10.0.0.3:/export_foo.
> 
> Proposed Solution:
> 
> The network namespace of the task that did the mount best defines which server
> the "administrator", whether in a container or not, expects to work with.
> When the mount is done inside a container then that is the network namespace 
> to use. When the mount is done prior to creating the container then that's the 
> namespace that should be used.
> 
> This allows system administrators to isolate network traffic generated by NFS
> clients by mounting after creating a container. If partial isolation is desired
> then the administrator may mount before creating a container with a new network
> namespace. In each case the RPC packets would originate from a consistent
> namespace.
> 
> One way to ensure consistent namespace usage would be to hold a reference to
> the original network namespace as long as the mount exists. This naturally 
> suggests storing the network namespace reference in the NFS superblock. 
> However, it may be better to store it with the RPC transport itself since
> it is directly responsible for (re)opening the sockets.
> 
> This patch adds a reference to the network namespace to the RPC
> transport. When the NFS export is mounted the network namespace of
> the current task establishes which namespace to reference. That
> reference is stored in the RPC transport and used to open sockets
> whenever a new socket is required.

Ewwwwwwww

You ignore the fact that NFS super blocks that point to the same
filesystem are shared (including between containers). We don't want to
have separate page caches in cases where the filesystems are the same;
that causes unnecessary cache consistency problems. There is sharing at
other levels too. All NFSv4 super blocks that share a server IP address,
will also share a common lease. Ditto when it comes to NFSv2 and NFSv3
clients, and lock monitoring state.

You ignore the fact that NFS often depends on a whole slew of other RPC
services. Kernel services like NLM (a.k.a lockd), the portmap/rpcbind
client, and user space utilities like statd and the portmap/rpcbind
server. Are we supposed to add socket namespace crap to all those apis
too?

What happens to services like rpc.gssd, of which there is only one user
space instance, and which use the ip address of the server (as supplied
by the kernel) to figure out who they are talking to?

Finally, what happens if someone decides to set up a private socket
namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
a private mount namespace? Would anyone have even the remotest chance in
hell of figuring out what filesystem is mounted where in the ensuing
chaos?

Trond