C/R: File substitution at restart

Serge E. Hallyn serge at hallyn.com
Wed Sep 8 18:03:52 PDT 2010

Quoting Matt Helsley (matthltc at us.ibm.com):
> On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > Quoting Matthieu Fertré (matthieu.fertre at kerlabs.com):
> > > Hi,
> > > 
> > > Here is a proposal for a C/R related feature already developed in
> > > Kerrighed: file substitution at restart.
> > > 
> > > The goal of this mail is to start a discussion about adding such feature
> > > to Linux cr. Comments are welcome!
> > 
> > Yup, AFAIK metacluster and zap do this too.  I don't think there is
> > any question about whether we want to support this, but rather
> > what the user-kernel API should look like.  Perhaps the easiest
> > "API" is to have the userspace program rewrite the checkpoint image,
> > but that probably isn't quite as simple as just substituting #s in
> > the image, bc we'll have to also find the place where the source of
> > the original fd was specified and tweak that.
> > 
> > I assume this is one of the things Oren would have 'cradvise()'
> > do, and at this point that sounds nice to me - might be worth
> > seeing how the community reacts.  Sentiments on such things change,
> > after all.
> > 
> > Have there been any other suggestions?
> I think it can be split into two composable pieces which may also be
> useful independently.
> The first uses the fcntl() interface to add a flag like
> O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> restart. That way we don't have to specify an fd number and a "source"
> to the kernel. Just tell the kernel to keep the fd. The source can
> be opened and dup2'd via userspace. This is useful without the
> second piece if we want to simply add rather than replace an fd.

Can you think of any other use for this flag other than restart?
If so, then having a fcntl flag (and later madvise) makes sense.
But if we're going to add options to various different APIS which
really are all only useful for c/r, then maybe a single new cr_advise()
really does make sense.  The alternative may be more popular at first
but would IMO turn into a disaster.

> Then a separate interface/tool is needed to ignore/delete
> the extra CKPT_OBJ_FILE in the checkpoint image. That's the difficult
> part. It's difficult because depending on the open file the portions of
> the image to ignore/delete can vary wildly. For instance, imagine if an
> epoll fd was being ignored. It starts much like a generic file but there
> is an image header related to it that isn't a CKPT_OBJ_*. If we fail to
> delete/ignore this section prior to parsing then it completely breaks
> the parsing.

Yup, that is precisely what stopped me when I tried to do this 6 months
or so ago just for stdin/stdout/stderr.

> In contrast, CKPT_OBJ_* do not break the parsing since
> they aren't expected in a strict order -- the parser is capable of
> parsing them at any time and the only order constraint on them is that
> they appear in the image before they are referenced.
> This piece is also useful by itself if we want to ignore/delete an fd
> rather than substitute it.

Are you working on any of this?

More information about the Containers mailing list