[RFC][PATCH 2/2] CR: handle a single task with private memory maps

Wed Aug 6 09:15:46 PDT 2008

On Aug 5, 2008, at 9:23 AM, Dave Hansen wrote:

> On Mon, 2008-08-04 at 20:51 -0700, Joseph Ruscio wrote:
>> It might be desirable for the checkpointing implementation to be
>> modular enough that a userspace application or library could select  
>> to
>> handle certain resources on their own. Memory is the primary one that
>> comes to mind.
>
> How would you propose making it modular?
>
> -- Dave
>

Well it seems to me that the initial focus here is in live migration  
of traditional enterprise applications, e.g. databases, app-servers,  
etc. I think this is the right focus given how much utility the  
general enterprise is finding in capabilities like VMotion. Providing  
this mobility to applications without the overhead of traditional VM's  
would be very valuable.

On the other hand I've been primarily focused in checkpointing large- 
scale MPI jobs to provide fault tolerance, and that use-case is  
somewhat different then the live-migration one. These checkpoints have  
huge RAM footprints (in-core checkpointing is not an option), require  
coordination across large numbers of servers, some number of open  
files  on an enormous parallel filesystem, and some scratch files open  
on the local disk/ramdisk. They generally have very simple process  
trees with one process per core, or one process with a thread for each  
core.

To support these kinds of jobs, one would ideally instruct the  
Container checkpointer to ignore network resources, dynamically  
allocated private memory, and the contents of open files. You'd be  
relying on the Container checkpointer to recreate processes, open file  
descriptors, threads, thread synchronization primitives, IPC  
mechanisms (including shm).

As far as the mechanism is concerned, I'd defer to the more  
experienced kernel developers here. I assume that passing a bitmask of  
flags as an argument into the checkpoint syscall would be frowned  
upon, and anyways redundant, as its unlikely that the mask would  
change within a container from checkpoint to checkpoint. If each  
container is going to have a CGroup filesystem directory, then we  
could have a file(s) along the lines of /proc/sys/kernel/ 
randomize_va_space that turn features off for that Container. The  
default settings after Container creation would be a complete in- 
kernel checkpoint/migration.

thanks,
Joe