[PATCH 1/1] namespaces: introduce sys_hijack (v11)

Tue Aug 12 10:06:58 PDT 2008

Quoting Serge E. Hallyn (serue at us.ibm.com):
> Quoting Bastian Blank (bastian at waldi.eu.org):
> > On Fri, Aug 01, 2008 at 11:39:05AM -0500, Serge E. Hallyn wrote:
> > > Quoting Bastian Blank (bastian at waldi.eu.org):
> > > > Why is it not enough to use the pid of the ns creator? The ns cgroups
> > > 
> > > pids wrap around
> > 
> > Ups, yes.
> > 
> > > > But I think I have a different problem. Currently, namespaces are
> > > > destructed if the last process using them exits. You change that, they
> > > > will survive until the cgroup dies. Or is that cgroup destructed when
> > > > there are no longer processes using the nsproxy? As the commit message
> > > > speaks about "pid wraparound" as problem, I doubt that.
> > > 
> > > Correct.  Having the namespaces stick around, and being able to attach
> > > to an empty container, was something Paul Menage had wanted IIRC.
> > 
> > It may produce problems with pid namespaces. The namespace is cleared if
> > the child reaper dies and I'm not sure how well it behaves without a new
> > one, which you can't create.
> > 
> > > But I'll leave that as is for now, until I hear something other than
> > > "this is so wrong it isn't funny" from Pavel :)
> > 
> > I'm not sure if it is funny to add another piece which may hold
> > filesystems open. Currently we can have different namespaces. All of
> > them are attached to processes and can be removed with kill. Now this
> > code adds another copy to an (automatically created) cgroup.
> > 
> > IMHO, the cgroup should be destructed automatically if the nsproxy is
> > about to be die.
> 
> I certainly don't think your caution is unwarranted.  I like to keep the
> refcounting in all of this as simple as possible.

And as always those calling for caution are vindicated.  It turns out I
was grabbing a double-refcount on the nsproxy when a ns_cgroup is cloned.

After fixing that, I get warnings about potential circular locking
involving cgroup_mutex and namespace_sem.  This is because cgroup_mutex
depends on namespace_sem, but now doing rmdir on a once-filled ns_cgroup
calls put_fs_struct(ns_cgroup->fs).

But again, this patch was resent to solicit comment on the general
approach.  So I will put this patch aside again, unless I hear:

1. From Pavel, that he actually would like to use this approach for
namespace entering.

2. From Paul, that he still has a need for entering empty cgroups.

Otherwise, there is still the point of view (held I believe by Eric)
that the right thing to do is provide the monitoring and control over
containers that we need through proper namespace semantics and exported
filesystems.

thanks,
-serge