[BIG RFC] Filesystem-based checkpoint

Dave Hansen dave at linux.vnet.ibm.com
Mon Nov 3 09:23:19 PST 2008


On Fri, 2008-10-31 at 13:51 -0700, Eric W. Biederman wrote:
> Dave Hansen <dave at linux.vnet.ibm.com> writes:
> > Eric, you were saying that my interface had way too many "dangerous
> > syscalls".  How does this relate to user namespaces and creating objects
> > with particular ids?  Surely if the true problem with my suggested
> > approach has to do with creating empty namespaces, the same problem
> > exists with the sys_checkpoint() approach.
...
> In a sys_restore() scenario at the very start you can check to make
> certain that the reference count for the namespaces is 1 and that they
> are empty.  Which means there is no chance of confusing user space.
> 
> With fork_and_set_child_pid() what is a simple cheap one time check
> becomes an expensive painful one, if you can even implement it at all.
> 
> The difference is that with a bunch of small pieces you loose atomicity. 

I think we're just trading trade-offs here. :)

I believe your suggestion is simply to constrain the problem.  If we put
extra restrictions on sys_restart() to ensure that its job is simpler
then some of the implementation problems just go away.  That's
definitely a good approach.

In this case you are saying that, during a call to sys_restart(), we
should ensure that the task doing the restoring holds the only reference
to those namespaces.  If it does, that means that there can't possibly
be any security implications because no one else can possibly even *see*
those namespaces.  This is a laudable goal, but I'm not sure it works in
practice without more code.

The problem is that we can't possibly use refcounts (at least the ones
we have today) alone.  For instance, with the pid namespace, we would
have 1 ref for the 'init' process doing the sys_restore() call and then
a possible second refcount for /proc.  Perhaps we could differentiate
references to namespaces that instantiate objects inside the namespaces
from purely references to the namespace *itself*.

Rather than offering a solution for the filesystem-based approach, I'll
venture this: whatever I come up with will be extra code to glue things
back together, to detect when namespaces are "fresh" and able to be
scribbled into.  

Anyway, it's obvious that you and Oren don't like my approach just as
much as I don't like the syscalls.  So, I'll just drop it for now.  But,
please do keep it in the back of your minds in case it applies
somewhere.

-- Dave



More information about the Containers mailing list