C/R minisummit notes
orenl at cs.columbia.edu
Wed Jul 23 14:38:20 PDT 2008
Serge E. Hallyn wrote:
> Quoting Daniel Lezcano (dlezcano at fr.ibm.com):
>> * What are the problems that the linux community can solve with the
>> checkpoint/restart ?
>> Eric Biederman reminds at the previous OLS nobody complained about the
>> Pavel Emylianov : The startup of Oracle takes some minutes, if we
>> checkpoint just after the startup, Oracle can be restarted from this
>> point later and provide fast startup
>> Oren Laaden : Time travel, we can do monotonic snapshot and go back on
>> one of this snaphost.
>> Eric Biedreman : Priority running, checkpoint/kill an application and
>> run another application with a bigger priority
>> Denis Lunev : Task migration, move application on one host to another host
>> Daniel Lezcano : SSI (task migration)
>> * Preparing the kernel internals
>> OL : Can we implement a kernel module and move CR functionality into
>> the kernel itself later ?
>> EB : Better to add a little CR functionnality into the kernel itself
>> and add more after.
>> DLu : Problem with kernel version
>> OL : Compatibility with intermediate kernel version should be possible
>> with userspace conversion tools
>> DLu : Non sequential file for checkpoint statefile is a challenge
>> OL : yes, but possible and useful for compression/encryption
>> We showed that there are five steps to realize a checkpoint:
>> 1 - Pre-dump
> I'd just add here that the pre-dump is where you might start writing
> memory to disk, trying to get disk and memory closer and closer to
> being the same until, at some point, you decide they are close enough
> that you can go on to step two, and attempt the freeze+dump+migrate/kill
> with minimal downtime.
> Coming into the discussion my primary concern had been that doing a
> sys_checkpoint() system call would be tough to augment to provide this
> kind of incremental checkpoint, but this breakdown is great for that.
>> 2 - Freeze
>> 3 - Dump
>> 4 - Resume/kill
>> 5 - Post-dump
>> At this point we state we want create a proof of concept and
>> checkpoint/restart the simplest application.
> By which we mean, start with a piece of step 3 (and maybe a bit of
> step 4).
step 4 is also part of the freezer -- it's the unfreeze operation
(or force a SIGKILL to all processes in the container).
> Step 2 was pretty widely accepted to be the freezer subsystem, but
> noone seemed to be sure quite what the status of that was.
> Matt, can you remind us how the freezer cgroup is doing?
>> We will add iteratively more and more kernel resources.
>> Process hierarchy created from kernel or userspace ?
>> OL : Seems better to send a chunk of data to kernel and that restores
>> the processes hierarchy
>> PE : Agreed
>> OL : We should be able to checkpoint from inside the container, keep
>> that in mind for later.
>> => we need a syscall or a ioctl
>> The first items to address before implementing the Checkpoint are:
>> 1 - Make a container object (the context)
>> 2 - Freeze the container (extend cgroup freezer ?)
>> 3 - syscall | ioctl
>> First step:
>> * simplest application : A single process, without any file, no
>> checkpoint of text file (same file system for restart), no signals, no
>> syscall in the application, no ipc/no msgq, no network
>> Second step:
>> * multiple processes + zombie state
>> Third step:
>> * files, pipe, signals, socketpair ?
>> This proof of concept must came with a documentation describing what is
>> supported, what is not supported and what we plan to do.
> And there was talk of making sure that if you attempt to checkpoint an
> app using unsupported resources, we return -EAGAIN. There had been
> murmurings about giving more meaningful feedback, but I have no idea
> what that would look like.
yes. some of it is mentioned in the notes that I put in the wiki.
> Containers mailing list
> Containers at lists.linux-foundation.org
More information about the Containers