[PATCH 00/10] Checkpoint/restart of open, unlinked files

Mon Feb 28 21:05:20 PST 2011

Argh, always seem to forget some important details!
This series is based on Oren's ckpt-v23-rc1 tree plus:

1. sys-wrappers
	This patch set from Namhyung Kim wraps "various syscalls
	that were used in init code." These wrappers are also useful
	for the c/r patchset.

2. setns
	Because we're often checkpointing files in an namespace
	other than that occupied by the task calling sys_checkpoint()
	relink fails with EXDEV _unless_ we change to the appropriate
	mount namespace first. This dependency affects patch 9 of the
	series.

Cheers,
	-Matt Helsley

On Mon, Feb 28, 2011 at 08:05:06PM -0800, Matt Helsley wrote:
> This patch set implements the relink file operation and uses it to support
> checkpoint and restart of open, unlinked files. During checkpoint,
> sys_checkpoint relinks the files and returns. Userspace then checkpoints the
> filesystem contents using any backup-like method prior to thawing. That
> backup is then made available for use during an optional migration followed
> by restore and sys_restart. In the case of network and cluster/distributed
> filesystems copying the filesystem contents explicitly for migration may not
> be necessary at all -- it would be part of normal file writes. For
> non-migration uses of checkpoint/restart filesystems like btrfs a snapshot
> could simply be taken during checkpoint and mounted during restart -- again
> without requiring IO proportional to the aggregate size of filesystem
> contents being checkpointed.
> 
> These IO savings are critical to the use of checkpoint/restart as a
> fault mitigation solution in HPC environments where the probability of
> component failure is very high simply due to the number of system
> components. Incurring substantial IO for checkpoint/restart interferes
> with the IO requirements of HPC jobs and thus reduces the frequency of
> checkpoint/restart. That in turn means more processing time is lost
> as a consequence of a fault -- the longer period between checkpoints
> plus the IO required to re-establish hardlinks are simply not acceptable
> for these environments.
> 
> Without relinking we would need to walk the entire filesystem to find out
> that "b" is a path to the same inode (another variation on this case: "b"
> would also have been unlinked). We'd need to do this for every
> unlinked file that remains open in every task to checkpoint. Even then
> there is no guarantee such a "b" exists for every unlinked file -- the
> inodes could be "orphans" -- and we'd need to preserve their contents
> some other way.
> 
> I considered a couple alternatives to preserving unlinked file contents:
> copying and file handles. Each has significant drawbacks.
> 
> First I attempted to copy the file contents into the image and then
> recreate and unlink the file during restart. Using a simple version of
> that method the write above would not reach "b". One fix would be to search
> the filesystem for a file with the same inode number (inode of "b") and
> either open it or hardlink it to "a". Another would be to record the inode
> number. This either shifts the search from checkpoint time to restart time
> or has all the drawbacks of the second method I considered: file handles.
> 
> Instead of copying contents or recording inodes I also considered using
> file handles. We'd need to ensure that the filehandles persist in storage,
> can be snapshotted/backed up, and can be migrated. Can handlefs or any
> generic file handle system do this? My _guess_ is "no" but folks are
> welcome to tell me I'm wrong.
> 
> In contrast, linking the file from a_fd back into its filesystem can avoid
> these complexities. Relinking avoids the search for matching inodes and
> copying large quantities of data from storage only to write it back (in
> fact a non-linking solution requires that the data be read-and-written
> twice -- once for checkpoint and once for restart). Like file handles it does
> require changes to the filesystem code. Unlike file handles, enabling
> relinking does not require every filesystem to support a new kind of
> filesystem "object" -- only an operation that is quite similar to one that
> already exists: link.
> 
> [PATCH 01/10] Create the .relink file_operation
> [PATCH 02/10] ext3/4: Allow relinking to unlinked files
> [PATCH 03/10] Split do_linkat() out of sys_linkat
> [PATCH 04/10] Checkpoint/restart unlinked files
> [PATCH 05/10] Enable c/r of unlinked fifos
> [PATCH 06/10] Support relinking unlinked files in btrfs
> [PATCH 07/10] Add relink_dir superblock field
> [PATCH 08/10] Parse the relink=%s mount option
> [PATCH 09/10] Enabling checkpoint relink of unlinked files inside containers
> [PATCH 10/10] [RFC] Use call_usermodehelper to cleanup after failure
> 
> BUGS:
> 
> 	There's a memory leak (Reported-by: "Jose R. Santos"
> <jrs at linux.vnet.ibm.com>) that I haven't tracked down completely yet.
> It seems to be in the "relink=" mount option parsing code -- I feel like I
> must be missing some code path related to vfsmount handling.
> 
> Cheers,
> 	-Matt Helsley
> _______________________________________________
> Containers mailing list
> Containers at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers