[RFC][PATCH 2/2] CR: handle a single task with private memory maps

Louis Rilling Louis.Rilling at kerlabs.com
Thu Aug 7 02:25:28 PDT 2008


On Wed, Aug 06, 2008 at 08:41:10AM -0700, Joseph Ruscio wrote:
>
> On Aug 5, 2008, at 9:20 AM, Oren Laadan wrote:
>> Eh... and, yes, live migration :)
>
>  User-space live migration of a "batch" process e.g. one taking place in 
> an MPI job is quite trivial. User-space live migration of something like 
> a database is not that hard assuming you have a cooperative load  
> balancer or proxy on the front end.

Hm, this means modifying the MPI run-time, right? Especially the ones relying on
daemons on each node (like LAM implementation, and MPI2 specification IIRC).
Anyway, this is probably not an issue, since most high-end HPC systems come with
their own customized MPI implementation.

>
> I'm not advocating for implementing this in user-space. I am in complete 
> agreement that this effort should result in code that completely 
> checkpoints a Container in the kernel. My question was whether there are 
> situations where it would be advantageous for user-space to have the 
> option of instructing/hinting the kernel to ignore certain resources that 
> it would handle itself. Most of the use-cases I'm thinking of come from 
> the different styles of implementations I've seen in the HPC space, where 
> our implementation (and a lot of others) are focused.
>
> MPI codes require coordination between all the different processes  
> taking part to ensure that the checkpoints are globally consistent. MPI 
> implementations that run on hardware such as Infiniband would most  
> likely want the container checkpointing to ignore all of the pinned  
> memory associated with the RDMA operations so that the coordination and 
> recreation of MPI communicator state could be handled in user-space. When 
> working with inflexible process checkpointers, MPI coordination routines 
> often must completely teardown all communicator state prior to invoking 
> the checkpoint, and then recreate all the communicators after the 
> checkpoint. On very large scale jobs, this is expensive.
>
> As another example HPC applications can create local scratch files of  
> several GB in /tmp. It may not be necessary to migrate these files, but 
> if user-space has no way to mark a particular file, "local files", or 
> files in general as being ignored, then we'll have to copy these during a 
> migration or a checkpoint.

Definitely agree with you here. This is the kind of use-case we will study in
Kerrighed. (Actually the project is centered on supporting a petaflopic
application, with help from Kerrighed to tolerate failures).

>
> I don't suppose anyone is attending Linuxworld in San Francisco this  
> week? I'd be more then happy to grab a coffee and talk about some of  
> this. I stopped by the OpenVZ booth but none of the devs are around.

Not me, sorry :) However, whichever requirement you can describe is interesting
for us. They can surely help designing a most useful checkpoint/restart
mechanism.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.linux-foundation.org/pipermail/containers/attachments/20080807/bfcf5f4f/attachment.pgp 


More information about the Containers mailing list