[PATCH 00/80] Kernel based checkpoint/restart [v18]

Rishikesh risrajak at linux.vnet.ibm.com
Thu Sep 24 06:05:58 PDT 2009


Hi Oren,

I am getting following build error while compiling linux-cr kernel.

git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

...
76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
in a function)
76570 LD [M] drivers/net/enic/enic.o
76571 make[2]: *** [net/unix/af_unix.o] Error 1
76572 make[1]: *** [net/unix] Error 2
76573 make: *** [net] Error 2
76574 make: *** Waiting for unfinished jobs....
...

Let me know if you need config file.

-Rishi
Oren Laadan wrote:
> Hi Andrew,
>
> This is our recent round of checkpoint/restart patches. It can
> checkpoint and restart interactive sessions of 'screen' across 
> kernel reboot. Please consider applying to -mm.
>
> Patches 1-17 are clean-ups and preparations for c/r:
>  * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
>  * 5,6: fix freezer control group
>  * 7,8: extend freezer control group for c/r.
>  * 11-17: clone_with_pid
>
> Patch 18 reserves the system calls slots - please apply so we
> don't need to keep changing them.
>
> Patches 19-80 contain the actual c/r code; we've exhausted the
> reviewers for most of them.
>
> Patch 32 implements a deferqueue - mechanism for a process to
> defer work for some later time (unlike workqueue, designed for
> the work to execute in the context of same/original process).
>
> Thanks,
>
> Oren.
>
> ----
>
> Application checkpoint/restart (c/r) is the ability to save the state
> of a running application so that it can later resume its execution
> from the time at which it was checkpointed, on the same or a different
> machine.
>
> This version brings support many new features, including support for
> unix domain sockets, fifos, pseudo-terminals, and signals (see the
> detailed changelog below).
>
> With these in place, it can now checkpoint and restart not only batch
> jobs, but also interactive programs using 'screen'. For example, users
> can checkpoint a 'screen' session with multiple shells, upgrade their
> kernel, reboot, and restart their interactive 'screen' session from
> before !
>
> This patchset was compiled and tested against v2.6.31. For more
> information, check out Documentation/checkpoint/*.txt
>
> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).
>
> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.
>
> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.
>
> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems, and x86_64 support.
>    
> Q: How can I try it ?
> A: Use it for simple batch jobs (pipes, too), or an interactive
>    'screen' session, in a whole container or just a subtree of
>    tasks:
>
>    create the freezer cgroup:
>      $ mount -t cgroup -ofreezer freezer /cgroup
>      $ mkdir /cgroup/0
>    
>    run the test, freeze it:  
>      $ test/multitask &
>      [1] 2754
>      $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
>      $ echo FROZEN > /cgruop/0/freezer.state
>    
>    checkpoint:
>      $ ./ckpt 2754 > ckpt.out
>    
>    restart:
>      $ ./mktree < ckpt.out
>    
>    voila :)
>    
> To do all this, you'll need:
>
> The git tree tracking v18, branch 'ckpt-v18' (and past versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
>
> The userspace tools are available through the matching branch [v18]:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
>
>
> Changelog:
>
> [2009-Sep-22] v18
>
>   (new features)
>   - [Nathan Lynch] Re-introduce powerpc support
>   - Save/restore pseudo-terminals
>   - Save/restore (pty) controlling terminals
>   - Save/restore restore PGIDs
>   - [Dan Smith] Save/restore unix domain sockets
>   - Save/restore FIFOs
>   - Save/restore pending signals
>   - Save/restore rlimits
>   - Save/restore itimers
>   - [Matt Helsley] Handle many non-pseudo file-systems
>
>   (other changes)
>   - Rename headerless struct ckpt_hdr_* to struct ckpt_*
>   - [Nathan Lynch] discard const from struct cred * where appropriate
>   - [Serge Hallyn][s390] Set return value for self-checkpoint 
>   - Handle kmalloc failure in restore_sem_array()
>   - [IPC] Collect files used by shm objects
>   - [IPC] Use file (not inode) as shared object on checkpoint of shm
>   - More ckpt_write_err()s to give information on checkpoint failure
>   - Adjust format of pipe buffer to include the mandatory pre-header
>   - [LEAKS] Mark the backing file as visited at chekcpoint
>   - Tighten checks on supported vma to checkpoint or restart
>   - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
>   - Introduce ckpt_collect_file() that also uses file->collect method
>   - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
>   - Fix leak-detection issue in collect_mm() (test for first-time obj)
>   - Invoke set_close_on_exec() unconditionally on restart
>   - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
>   - Interface to pass simple pointers as data with deferqueue
>   - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
>   - Replace EAGAIN with EBUSY where necessary
>   - Introduce CKPT_OBJ_VISITED in leak detection
>   - ckpt_obj_collect() returns objref for new objects, 0 otherwise
>   - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
>   - Introduce ckpt_obj_visit() to mark objects as visited
>   - Set the CHECKPOINTED flag on objects before calling checkpoint
>   - Introduce ckpt_obj_reserve()
>   - Change ref_drop() to accept a @lastref argument (for cleanup)
>   - Disallow multiple objects with same objref in restart
>   - Allow _ckpt_read_obj_type() to read header only (w/o payload)
>   - Fix leak of ckpt_ctx when restoring zombie tasks
>   - Fix race of prepare_descendant() with an ongoing fork()
>   - Track and report the first error if restart fails
>   - Tighten logic to protect against bogus pids in input
>   - [Matt Helsley] Improve debug output from ckpt_notify_error()
>   - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
>   - Detect error-headers in input data on restart, and abort.
>   - Standard format for checkpoint error strings (and documentation)
>   - [Dan Smith] Add an errno validation function
>   - Add ckpt_read_payload(): read a variable-length object (no header)
>   - Add ckpt_read_string(): same for strings (ensures null-terminated)
>   - Add ckpt_read_consume(): consumes next object without processing
>   - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
>
> [2009-Jul-21] v17
>   - Introduce syscall clone_with_pids() to restore original pids
>   - Support threads and zombies
>   - Save/restore task->files
>   - Save/restore task->sighand
>   - Save/restore futex
>   - Save/restore credentials
>   - Introduce PF_RESTARTING to skip notifications on task exit
>   - restart(2) allow caller to ask to freeze tasks after restart
>   - restart(2) isn't idempotent: return -EINTR if interrupted
>   - Improve debugging output handling 
>   - Make multi-process restart logic more robust and complete
>   - Correctly select return value for restarting tasks on success
>   - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
>   - Use CHECKPOINTING state for frozen checkpointed tasks
>   - Fix compilation without CONFIG_CHECKPOINT
>   - Fix compilation with CONFIG_COMPAT
>   - Fix headers includes and exports
>   - Leak detection performed in two steps
>   - Detect "inverse" leaks of objects (dis)appearing unexpectedly
>   - Memory: save/restore mm->{flags,def_flags,saved_auxv}
>   - Memory: only collect sub-objects of mm once (leak detection)
>   - Files: validate f_mode after restore
>   - Namespaces: leak detection for nsproxy sub-components
>   - Namespaces: proper restart from namespace(s) without namespace(s)
>   - Save global constants in header instead of per-object
>   - IPC: replace sys_unshare() with create_ipc_ns()
>   - IPC: restore objects in suitable namespace
>   - IPC: correct behavior under !CONFIG_IPC_NS
>   - UTS: save/restore all fields
>   - UTS: replace sys_unshare() with create_uts_ns()
>   - X86_32: sanitize cpu, debug, and segment registers on restart
>   - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
>   - cgroup_freezer: add interface to freeze a cgroup (given a task)
>
> [2009-May-27] v16
>   - Privilege checks for IPC checkpoint
>   - Fix error string generation during checkpoint
>   - Use kzalloc for header allocation
>   - Restart blocks are arch-independent
>   - Redo pipe c/r using splice
>   - Fixes to s390 arch
>   - Remove powerpc arch (temporary)
>   - Explicitly restore ->nsproxy
>   - All objects in image are precedeed by 'struct ckpt_hdr'
>   - Fix leaks detection (and leaks)
>   - Reorder of patchset
>   - Misc bugs and compilation fixes
>
> [2009-Apr-12] v15
>   - Minor fixes
>
> [2009-Apr-28] v14
>   - Tested against kernel v2.6.30-rc3 on x86_32.
>   - Refactor files chekpoint to use f_ops (file operations)
>   - Refactor mm/vma to use vma_ops
>   - Explicitly handle VDSO vma (and require compat mode)
>   - Added code to c/r restat-blocks (restart timeout related syscalls)
>   - Added code to c/r namespaces: uts, ipc (with Dan Smith)
>   - Added code to c/r sysvipc (shm, msg, sem)
>   - Support for VM_CLONE shared memory
>   - Added resource leak detection for whole-container checkpoint
>   - Added sysctl gauge to allow unprivileged restart/checkpoint
>   - Improve and simplify the code and logic of shared objects
>   - Rework image format: shared objects appear prior to their use
>   - Merge checkpoint and restart functionality into same files
>   - Massive renaming of functions: prefix "ckpt_" for generics,
>     "checkpoint_" for checkpoint, and "restore_" for restart.
>   - Report checkpoint errors as a valid (string record) in the output
>   - Merged PPC architecture (by Nathan Lunch),
>   - Requires updates to userspace tools too.
>   - Misc nits and bug fixes
>
> [2009-Mar-31] v14-rc2
>   - Change along Dave's suggestion to use f_ops->checkpoint() for files
>   - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
>   - Merge support for PPC arch (Nathan Lynch)
>   - Misc cleanups and fixes in response to comments
>
> [2009-Mar-20] v14-rc1:
>   - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
>   - Check whether calls to cr_hbuf_get() succeed or fail.
>   - Fixed of pipe c/r code
>   - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
>   - Refuse non-self checkpoint if a task isn't frozen
>   - Use unsigned fields in checkpoint headers unless otherwise required
>   - Rename functions in files c/r to better reflect their role
>   - Add support for anonymous shared memory
>   - Merge support for s390 arch (Dan Smith, Serge Hallyn)
>     
> [2008-Dec-03] v13:
>   - Cleanups of 'struct cr_ctx' - remove unused fields
>   - Misc fixes for comments
>   
> [2008-Dec-17] v12:
>   - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
>     (empty pgarr are saves in a separate pool chain)
>   - Add a couple of missed calls to cr_hbuf_put()
>   - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
>   - Split cr_write/cr_read() to two parts: _cr_write/read() helper
>   - Befriend with sparse: explicit conversion to 'void __user *'
>   - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
>
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
>
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
>
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
>
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
>
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
>
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
>
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
>
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
>
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
>
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
>
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
>
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
>
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
>
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
>
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
>
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
>
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> _______________________________________________
> Containers mailing list
> Containers at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>   



More information about the Containers mailing list