[BIG RFC] Filesystem-based checkpoint

Thu Oct 30 12:28:17 PDT 2008

Quoting Oren Laadan (orenl at cs.columbia.edu):
> 
> I'm not sure why you say it's "un-linux-y" to begin with. But to the

The thing that is un-linux-y is specifically having user-space pass an
fd to the kernel from which it reads/writes.  LSMs had to go to a lot of
pain to avoid doing that for reading policy configuration at boot.

Of course it's now several years later, and moods and tastes change in
the kernel community, but I suspect it's still frowned upon.

> point, here are my thought:
> 
> 
> 1. What you suggest is to expose the internal data to user space and
> pull it. Isn't that what cryo tried to do ?  And the conclusion was
> that it takes too many interfaces to work out, code in, provide, and
> maintain forever, with issues related to backward compatibility and
> what not. In fact, the conclusion was "let's do a kernel-blob" !

Right, the problem with cryo was that it tried to do the checkpoint and
restart themselves at too fine-grained a level in terms of kernel-user
API.

What Dave is suggesting (as I understand it) is just changing the way
the data is shipped between kernel and user-space.  But to continue with
sys_checkpoint() and sys_restart().  So I think it's a less fundamental
change than you are thinking.

Now maybe eventually he's going to propose something more esotaric where
doing the mount() actually starts the checkpoint (that's where I figured
he'd be heading), but I think it would still be one action on the part
of userspace telling the kernel "do a checkpoint".

(Or am I wrong on that, Dave?)

[...]

(I'll let Dave respond to your other questions i.e. about what you gain)

> If this is only to be able to parallelize checkpoint - then let's discuss
> the problem, not a specific solution.

The specific problem is that you have userspace pass a file fd to the
kernel and kernel reading/writing to it, which is un-linuxy.

> > It enables us to do all the I/O from userspace: no in-kernel
> > sys_read/write().
> 
> What's so wrong with in-kernel vfs_read/write() ?  You mentioned deadlocks,

It's un-linux-y :)

[...]

> 5. Your suggestions leaves too many details out. Yes, it's a call for
> discussion. But still. Zap, OpenVZ and other systems build on experience
> and working code. We know how to do incremental, live, and other goodies.
> I'm not sure how these would work with your scheme.

Not sure what problems you envision, but taking the specific example of
pre-dump to prepare for a quick live migration, I could envision a
pre_checkpoint() system call creating the checkpoint data directory
and starting to dump out the data, and starting to copy that data
over the network (optimistically), after which the do_checkpoint()
syscall checks file timestamps and quickly dumps and network-copies the
data which has changed up until the container was frozen.

-serge