[PATCH 0/6] /proc/pid/checkpointable
Serge E. Hallyn
serue at us.ibm.com
Thu Mar 26 06:29:12 PDT 2009
Quoting Cedric Le Goater (legoater at free.fr):
> Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm at xmission.com):
> >> Dave Hansen <dave at linux.vnet.ibm.com> writes:
> >>> On Wed, 2009-03-18 at 13:03 -0700, Mike Waychison wrote:
> >>>> Polluting the dmesg buffer with messages from common failures (consider
> >>>> a multi-user cluster where checkpoints may or may not succeed) isn't
> >>>> very useful.
> >>> Yeah, I've already gotten an earful from Serge and Dan S. about this. :)
> >>> Serge suggested that, perhaps, the audit framework could be used. We
> >>> might also use an ftrace buffer if we want to keep a whole ton of
> >>> messages around, too.
> >>> dmesg is definitely not workable long-term at all.
> >> How about having place holder objects in the generated checkpoint.
> >> Then instead of having a failure you have a non-restoreable checkpoint.
> >> But you know which fd, or which mmaped region, or which other thing
> >> is causing the problem and if you want more information you can
> >> look at that resource.
> >> That gives user space the freedom and scrub out the non-checkpointable
> >> bits and replace them with something like /dev/null so that we can
> >> continue on and restore the checkpoint anyway, if we think our
> >> app can cope with some things going away.
> >> Eric
> > I like this idea.
> yes. This is something required to replace stdios for example, when
> you execute an application under ssh, checkpoint and then restart on
> an other host. This a topical scenario for a batch manager in an HPC
> identified resources of the container are tracked to be ignored by
> checkpoint and to be replaced by similar ones at restart.
So in that case how are the resources identified? Does the user
specify them at checkpoint? Do you look for specific strings
(/dev/pts/*) at restart?
More information about the Containers