[Ksummit-2010-discuss] checkpoint-restart: naked patch
orenl at cs.columbia.edu
Sat Nov 20 10:05:15 PST 2010
Based on discussion with Gene, I'd like to clarify key points and
difference between kernel and userspace approaches (specifically
linux-cr and dmtcp): three parts to break the long post...
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches
[now relax, grab (another) cup of coffee and read on...]
PART I: ==PERSPECTIVE==
A rough classification of c/r categories:
* container-c/r: important use-case, e.g. c/r and migration of an
application containers like VPS (virtual private server), VDI
(desktop) or other self-contained application (e.g. Oracle server).
Here _all_ the relevant processes are included in the checkpoint.
* standalone-c/r: another use-case is standalone-c/r where a set of
processes is checkpointed, but not the entire environment, and then
those processes are restarted in a different "eco-system".
* distributed-c/r: meaning several sets of processes, each running
on a different host. (Each set may be a separate container there).
In container-c/r, the main challenge is to be _reliable_ in the sense
that a restart from a successful checkpoint should always succeed.
In standalone-c/r, the main challenge is that an application resumes
execution after a restart in a possible _different_ eco-system. Some
application don't care (e.g 'bc'). Other applications do care, and to
different degrees; for these we need "glue" to pacify the application.
There are generally three types of "glue":
(1) Modify the application or selected libraries to be c/r-aware, and
notify it when restart completes. (e.g. CoCheck MPI library).
(2) Add a userspace helper that will run post-restart to do necessary
trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem
at the new host after migration; reconnect a socket to a peer).
(3) Use interposition on selected library calls and add wrapper code
that will glue in what's missing (e.g. dbus or nscd calls to
reconnect an application to those services).
IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done !
We are strictly discussion the core c/r functionality.
(next part: linux-cr philosophy...)
More information about the Containers