C/R: File substitution at restart

Louis Rilling Louis.Rilling at kerlabs.com
Thu Sep 9 03:37:20 PDT 2010

On 08/09/10 21:06 -0700, Matt Helsley wrote:
> On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote:
> > Quoting Matt Helsley (matthltc at us.ibm.com):
> > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > > I think it can be split into two composable pieces which may also be
> > > useful independently.
> > > 
> > > The first uses the fcntl() interface to add a flag like
> > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> > > restart. That way we don't have to specify an fd number and a "source"
> > > to the kernel. Just tell the kernel to keep the fd. The source can
> > > be opened and dup2'd via userspace. This is useful without the
> > > second piece if we want to simply add rather than replace an fd.
> > 
> > Can you think of any other use for this flag other than restart?
> <joking>
> I can't think of any other uses for O_CLOEXEC.
> </joking>
> Seriously though, restart will be used _much_ less often than exec so yes
> it does seem like a waste of a valuable bit and something that wouldn't
> quite belong in an fcntl interface.
> However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC.
> Right now restart closes all file descriptors and pays absolutely
> no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST
> too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we
> want to keep we do not mark with O_CLOEXEC.

This would also be useful at checkpoint, to tell sys_checkpoint() which fds
should be ignored, being because it is not supported or because the application
has a better way to deal with it.

> Here's another idea which I haven't fully thought out yet.
> We could introduce the concept of object id substitutions in the image.
> So the image would look like (going from file pos 0 at the top..):
> 0 +-------------------------------+
>   |                               |
>                 .....
>   +-------------------------------+
>   |     <substitute object>       | <--- object with id == <substitute id>
>                 .....
>   +---------------+---------------+
>   |  <object id>  |<substitute id>|
>   +---------------+---------------+
>                 .....
>   +---------------+---------------+
>   |     <object to ignore>        | <-- object with id == <object id>
>                 .....
> (The above is ignoring the ckpt_hdr fields..)
> When we read the image during restart we use the substitute ids to
> create indirect objhash entries. When we encounter an obj id and
> it refers to an indirect entry we first parse the object (ignoring
> errors and dropping references on new objhash insertions), flip
> a bit on the indirect entry (indicating the object has been parsed),
> and then lookup the substitute id and return whatever that resolved to.
> We can ignore the new objhash objects by making the objhash have its
> own operation struct. When we're parsing an object that's been
> substituted we just temporarily set the objhash add/lookup operations
> to something suitable for properly dropping references to the new
> object(s). This way we don't have to add checks for this peculiar
> need all over the checkpoint/restart code. Sure it'll be slower...

If at checkpoint we can take care to ignore files that we know will be
substituted, this should not be that slower.

> I can think of a few problems with that already. If the substituted
> obj differs wildly in file type then any defer queue entries that use
> obj ids to complete the deferred work would fail miserably...

The problem I see with rewriting the image is that this may impose additional
I/O, for instance to duplicate the image before rewriting, or if it is simply
rewritten to disk. In contrast, having an easily parsable table of fds at the
beginning of the image, with associated object ids (and preferably more info
like file type, path, owner rights, etc.) makes it easy and lightweight to
build a separate substitution table, that we could feed sys_restart() with
(maybe only the coordinator could feed sys_restart() with such a table).

> That said, so far I've never heard folks discuss substituting anything
> but fds. Perhaps enabling substitution at the objhash level is just too
> broad and we'd be better off only allowing fd substitutions?

Well... A while ago I asked about substituting SYSV IPC objects (I was talking
about SHM at that moment, but semaphore sets or message queues are even easier
to substitute). In such a scenario a pipeline of video transcoding would use
SYSV SHMs to store transitional frames between the stages of the pipeline, and
the SHMs themselves would not need to be checkpointed, or could be checkpointed
at a lower frequency than the processes.

Substituting memory mapped files (for instance POSIX SHMs) would be useful too.

> > If so, then having a fcntl flag (and later madvise) makes sense.
> > But if we're going to add options to various different APIS which
> > really are all only useful for c/r, then maybe a single new cr_advise()
> > really does make sense.  The alternative may be more popular at first
> > but would IMO turn into a disaster.
> Good point.

cr_advise() or changing sys_checkpoint() and sys_restart() are both fine to me.



Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url : http://lists.linux-foundation.org/pipermail/containers/attachments/20100909/63fe345f/attachment.pgp 

More information about the Containers mailing list