C/R minisummit notes

sukadev at us.ibm.com sukadev at us.ibm.com
Wed Jul 23 18:41:22 PDT 2008


Oren Laadan [orenl at cs.columbia.edu] wrote:
| 
| 
| Serge E. Hallyn wrote:
| > Quoting Daniel Lezcano (dlezcano at fr.ibm.com):
| >>   * What are the problems that the linux community can solve with the 
| >> checkpoint/restart ?
| >>
| >> 	Eric Biederman reminds at the previous OLS nobody complained about the 
| >> checkpoint/restart
| >>
| >> 	Pavel Emylianov : The startup of Oracle takes some minutes, if we 
| >> checkpoint just after the startup, Oracle can be restarted from this 
| >> point later and provide fast startup
| >>
| >> 	Oren Laaden : Time travel, we can do monotonic snapshot and go back on 
| >> one of this snaphost.
| >>
| >> 	Eric Biedreman : Priority running, checkpoint/kill an application and 
| >> run another application with a bigger priority
| >>
| >> 	Denis Lunev : Task migration, move application on one host to another host
| >>
| >> 	Daniel Lezcano : SSI (task migration)
| >>
| >>   * Preparing the kernel internals
| >>
| >> 	OL : Can we implement a kernel module and move CR functionality into 
| >> the kernel itself later ?
| >>
| >> 	EB : Better to add a little CR functionnality into the kernel itself 
| >> and add more after.
| >>
| >> 	DLu : Problem with kernel version
| >>
| >> 	OL : Compatibility with intermediate kernel version should be possible 
| >> with userspace conversion tools
| >>
| >> 	DLu : Non sequential file for checkpoint statefile is a challenge
| >>
| >> 	OL : yes, but possible and useful for compression/encryption
| >>
| >> 	We showed that there are five steps to realize a checkpoint:
| >>
| >> 	1 - Pre-dump
| > 
| > I'd just add here that the pre-dump is where you might start writing
| > memory to disk, trying to get disk and memory closer and closer to
| > being the same until, at some point, you decide they are close enough
| > that you can go on to step two, and attempt the freeze+dump+migrate/kill
| > with minimal downtime.
| > 
| > Coming into the discussion my primary concern had been that doing a
| > sys_checkpoint() system call would be tough to augment to provide this
| > kind of incremental checkpoint, but this breakdown is great for that.
| > 
| >> 	2 - Freeze
| >> 	3 - Dump
| >> 	4 - Resume/kill
| >> 	5 - Post-dump
| >>
| >> 	At this point we state we want create a proof of concept and 
| >> checkpoint/restart the simplest application.
| > 
| > By which we mean, start with a piece of step 3 (and maybe a bit of
| > step 4).
| 
| step 4 is also part of the freezer -- it's the unfreeze operation
| (or force a SIGKILL to all processes in the container).

Are steps 1-5 considered part of the sys_checkpoint() system call and
if successful sys_checkpoint() returns after step 5 ?

If so, like Serge points out, it would be harder to optimize for
incremental checkpoints (as each sys_checkpoint() would be independent) ?

But may not be something to worry about for POC.


More information about the Containers mailing list