[BIG RFC] Filesystem-based checkpoint

Dave Hansen dave at linux.vnet.ibm.com
Tue Oct 28 11:37:27 PDT 2008


I hate the syscall.  It's a very un-Linux-y way of doing things.  There,
I said it.  Here's an alternative.  It still uses the syscall to
initiate things, but it uses debugfs to transport the data instead.
This is just a concept demonstration.  It doesn't actually work, and I
wouldn't be using debugfs in practice.

System calls in Linux are fast.  Doing lots of them is not a problem.
If it becomes one, we can always export a condensed version of this
format next to the expanded one, kinda like ftrace does.  Atomicity with
this approach is also not a problem.  The system call in this approach
doesn't return until the checkpoint is completely written out.

This lets userspace pick and choose what parts of the checkpoint it
cares about.  It enables us to do all the I/O from userspace: no
in-kernel sys_read/write().  I think this interface is much more
flexible than a plain syscall.

Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
store it in-kernel.  Dump that out when the filesystem is accessed.
Destroy it when userspace asks.

Want to do a checkpoint with a small memory footprint?
10 write one struct
20 wait for userspace
30 goto 10

Userspace can loop like it is reading a pipe.  We could even track
per-checkpoint memory usage in the cr_ctx and stop writing when we go
over a certain memory threshold.

We can have two modes, internally.  Userspace never has to know what
which one we've chosen.  Say we have a word of data to output.  We can
either make a copy at sys_checkpoint() time and let the data continue to
be modified (let the task run).  Or, we can keep the task frozen and
generate data at debugfs read() time.  This means potentially zero
copying of data until userspace wants it.

The same goes for structures which might have complicated locking or
lifetime rules.  

This also shows how we might handle shared objects.

To use, just sys_checkpoint() as before, and look at /sys/kernel/debug/.
Use the crid you got back from the syscall to locate your checkpoint.
Write into the 'done' file when you want the sys_checkpoint() to return.

/sys/kernel/debug/checkpoint-1/
/sys/kernel/debug/checkpoint-1/done
/sys/kernel/debug/checkpoint-1/task-1141
/sys/kernel/debug/checkpoint-1/task-1141/fds
/sys/kernel/debug/checkpoint-1/task-1141/fds/1
/sys/kernel/debug/checkpoint-1/task-1141/fds/1/coe
/sys/kernel/debug/checkpoint-1/task-1141/fds/1/fd_nr
/sys/kernel/debug/checkpoint-1/task-1141/fds/1/fd
/sys/kernel/debug/checkpoint-1/task-1141/fds/0
/sys/kernel/debug/checkpoint-1/task-1141/fds/0/coe
/sys/kernel/debug/checkpoint-1/task-1141/fds/0/fd_nr
/sys/kernel/debug/checkpoint-1/task-1141/fds/0/fd
/sys/kernel/debug/checkpoint-1/files
/sys/kernel/debug/checkpoint-1/files/2
/sys/kernel/debug/checkpoint-1/files/2/f_version
/sys/kernel/debug/checkpoint-1/files/2/f_pos
/sys/kernel/debug/checkpoint-1/files/2/f_mode
/sys/kernel/debug/checkpoint-1/files/2/f_flags
/sys/kernel/debug/checkpoint-1/files/1
/sys/kernel/debug/checkpoint-1/files/1/target
/sys/kernel/debug/checkpoint-1/files/1/fd_type
/sys/kernel/debug/checkpoint-1/files/1/f_version
/sys/kernel/debug/checkpoint-1/files/1/f_pos
/sys/kernel/debug/checkpoint-1/files/1/f_mode
/sys/kernel/debug/checkpoint-1/files/1/f_flags

So, why not?

-- Dave
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debugfs-fun0.patch
Type: text/x-patch
Size: 9039 bytes
Desc: not available
Url : http://lists.linux-foundation.org/pipermail/containers/attachments/20081028/fbdb26df/attachment-0001.bin 


More information about the Containers mailing list