Device Namespaces

Mon Oct 28 23:31:17 UTC 2013

2013/9/26 Eric W. Biederman <ebiederm at xmission.com>
>
>
> From conversations at Linux Plumbers Converence it became fairly clear
> that one if not the roughest edge on containers today is dealing with
> devices.
>
> - Hotplug does not work.
> - There seems to be no implementation that does a much beyond creating
>   setting up a static set of /dev entries today.
> - Containers do not see the appropriate uevents for their container.
>
> One of the more compelling cases I heard was of someone who was running
> the a Linux Desktop in container and wanted to just let that container
> see the devices needed for his desktop, and not everything else.

I had experience of implementing this functionality in OpenVZ kernel.
I had requirements to not modify user-space tools, so that
implementations looks as dirty hack, but even hotplug of devices are
workin there.

....

>
> So the big issues for a device namespace to solve are filtering which
> devices a container has access to and being able to dynamically change
> which devices those are at run time (aka hotplug).
>
> After having thought about this for a bit I don't know if a pure
> userspace solution is sufficient or actually a good idea.

I would prefer to think a bit more about userspace solution. We can
try to expand udev functionality.

>
> - We can manually manage a tmpfs with device nodes in userspace.
>   (But that is deprecated functionality in the mainstream kernel).
> - We can manually export a subset of sysfs with bind mounts.
>   (But that feels hacky, and is essentially incompatible with hotplug).
> - We can relay a call of /sbin/hotplug from outside of a container
>   to inside of a container based on policy.
>   (But no one uses /sbin/hotplug anymore).
> - There is no way to fake netlink uevents for a container to see them.
>   (The best we could do is replace udev everywhere with something that
>    listens on a unix domain socket).

or we can teach udev to listens on a unix domain socket.

The host udev listens netlink. When it gets an event about a new
device, it decides for which containers it must be avaliable, does all
required actions and sends events in containers. Probably the protocol
of notifications must be unified for all udev-like services.

>
> - It would be nice to replace the device cgroup with a comprehensive
>   solution that really works. (Among other things the device cgroup
>   does not work in terms of struct device the underlying kernel
>   abstraction for devices).
>
> We must manage sysfs entries as well device nodes because:
> - Seeing more than we should has the real potential to confuse
>   userspace, especially a userspace that replays uevents.
> - Some device control must happens through writing to sysfs files and
>   if we don't remove all root privileges from a container only by
>   exporting a subset of sysfs to that container can we limit which
>   sysfs nodes can be written to.

Sorry if a following idea will sound crazy. Can we use fuse
filesystems for filtering sysfs and devtmpfs? When a CT mounts sysfs,
it will mount fuse-sysfs, which is implemented by userspace program on
host system.

* This way allows to emulate the behavior of uevent files in
containers, if we will use unix sockets between udev services.
* Probably a userspace daemon will be more flexible and customizable
than something in kernel

Do we have a use case when a perfomance of sysfs is critical?

Thanks,
Andrey