[PATCH 00/10] Checkpoint/restart of open, unlinked files
matthltc at us.ibm.com
Mon Feb 28 20:05:06 PST 2011
This patch set implements the relink file operation and uses it to support
checkpoint and restart of open, unlinked files. During checkpoint,
sys_checkpoint relinks the files and returns. Userspace then checkpoints the
filesystem contents using any backup-like method prior to thawing. That
backup is then made available for use during an optional migration followed
by restore and sys_restart. In the case of network and cluster/distributed
filesystems copying the filesystem contents explicitly for migration may not
be necessary at all -- it would be part of normal file writes. For
non-migration uses of checkpoint/restart filesystems like btrfs a snapshot
could simply be taken during checkpoint and mounted during restart -- again
without requiring IO proportional to the aggregate size of filesystem
contents being checkpointed.
These IO savings are critical to the use of checkpoint/restart as a
fault mitigation solution in HPC environments where the probability of
component failure is very high simply due to the number of system
components. Incurring substantial IO for checkpoint/restart interferes
with the IO requirements of HPC jobs and thus reduces the frequency of
checkpoint/restart. That in turn means more processing time is lost
as a consequence of a fault -- the longer period between checkpoints
plus the IO required to re-establish hardlinks are simply not acceptable
for these environments.
Without relinking we would need to walk the entire filesystem to find out
that "b" is a path to the same inode (another variation on this case: "b"
would also have been unlinked). We'd need to do this for every
unlinked file that remains open in every task to checkpoint. Even then
there is no guarantee such a "b" exists for every unlinked file -- the
inodes could be "orphans" -- and we'd need to preserve their contents
some other way.
I considered a couple alternatives to preserving unlinked file contents:
copying and file handles. Each has significant drawbacks.
First I attempted to copy the file contents into the image and then
recreate and unlink the file during restart. Using a simple version of
that method the write above would not reach "b". One fix would be to search
the filesystem for a file with the same inode number (inode of "b") and
either open it or hardlink it to "a". Another would be to record the inode
number. This either shifts the search from checkpoint time to restart time
or has all the drawbacks of the second method I considered: file handles.
Instead of copying contents or recording inodes I also considered using
file handles. We'd need to ensure that the filehandles persist in storage,
can be snapshotted/backed up, and can be migrated. Can handlefs or any
generic file handle system do this? My _guess_ is "no" but folks are
welcome to tell me I'm wrong.
In contrast, linking the file from a_fd back into its filesystem can avoid
these complexities. Relinking avoids the search for matching inodes and
copying large quantities of data from storage only to write it back (in
fact a non-linking solution requires that the data be read-and-written
twice -- once for checkpoint and once for restart). Like file handles it does
require changes to the filesystem code. Unlike file handles, enabling
relinking does not require every filesystem to support a new kind of
filesystem "object" -- only an operation that is quite similar to one that
already exists: link.
[PATCH 01/10] Create the .relink file_operation
[PATCH 02/10] ext3/4: Allow relinking to unlinked files
[PATCH 03/10] Split do_linkat() out of sys_linkat
[PATCH 04/10] Checkpoint/restart unlinked files
[PATCH 05/10] Enable c/r of unlinked fifos
[PATCH 06/10] Support relinking unlinked files in btrfs
[PATCH 07/10] Add relink_dir superblock field
[PATCH 08/10] Parse the relink=%s mount option
[PATCH 09/10] Enabling checkpoint relink of unlinked files inside containers
[PATCH 10/10] [RFC] Use call_usermodehelper to cleanup after failure
There's a memory leak (Reported-by: "Jose R. Santos"
<jrs at linux.vnet.ibm.com>) that I haven't tracked down completely yet.
It seems to be in the "relink=" mount option parsing code -- I feel like I
must be missing some code path related to vfsmount handling.
More information about the Containers