Linux Checkpoint-Restart - v19

Oren Laadan orenl at cs.columbia.edu
Fri Mar 19 08:34:09 PDT 2010



Jiro SEKIBA wrote:
> Hi,
> On 2010/03/18, at 5:55, Serge E. Hallyn wrote:
> 
>> Quoting Jiro SEKIBA (jir at dependable-os.net):
>>> Hi,
>>>
>>> Thank you for prompt reply!
>>> Sorry that I didn't post to containers at lists.linux-foundation.org.
>>>
>>> On 2010/03/16, at 7:55, Oren Laadan wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks for taking the time to evaluate c/r. You may want to also
>>>> try the latest, which is (as of now) ckpt-v20-rc2.
>>> Yeah, I'll eventually try to keep up with the latest,
>>> but I just want to try the one  you think it's stable first anyway.
>>>
>>>> In the future, please CC the containers mailing list for issues
>>>> related to c/r, at "containers at lists.linux-foundation.org".
>>>>
>>>> Jiro SEKIBA wrote:
>>>>> Hi,
>>>>> I'm trying to evaluate external checkpoint/restart with cr-v19 kernel.
>>>>> However, when I restart, I got "Killed" message in stdout.
>>>>> Do you have any tips or clue that are not in
>>>>> Documentation/checkpoint/usage.txt ?
>>>>> I'm using kernel pulled from
>>>>> git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git .
>>>>> checkout tag named "ckpt-v19". Base distro is ubuntu 9.10.
>>>>> I ran self checkpioint/restart sample program in Documentation/checkpint.
>>>>> It works as written in usage.txt.
>>>>> However, I can not make external checkpint/restart work properly.
>>>>> I made a simple test program bellow and create checkpoint externally using
>>>>> the program in Documentation/checkpoint/, it looks checkpoint file is
>>>>> created properly.
>>>>> However, when I ran self_restart < ckpt.image, I got "Killed" message.
>>>> If you take an external checkpoint, then you need to match it
>>>> with an external restart, as opposed to self_restart.
>>>>
>>>> Otherwise, restarting with self_restart from a checkpoint that is
>>>> not a self-checkpoint can yield unexpected results.
>>>>
>>>> Since you don't mention in your post, I don't know if you are using
>>>> the tools from user-cr. If not, then you should use 'checkpoint' and
>>>> 'restart' tools from there. It is available from:
>>>> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
>>>> (use the same branch as the one you used to linux-cr).
>>>>
>>>> Once you have the tools compiled, and you checkpoint with the
>>>> 'checkpoint' utility from there, you can restart with:
>>>> 	restart -v < ckpt.image
>>>>
>>> Thank you for the information.
>>> Actually I was trying to create checkpoint in Document/checkpints.
>>>
>>> Now, I tried with user-cr, compiled binary in the same tag (ckpt-v19).
>>> Creating checkpoint  looks OK and restart -v shows it Success.  nice!
>>> However, the contents in /tmp/test.out never get further,
>>> it remains same as when created checkpoint.
>>>
>>> I tried  "./restart -F /cgroup/0 -v --no-pidns < ckpt.image", got Success.
>>> cat /cgroup/0/tasks tells that there is a process.
>>> ps shows ./test.  So, it looks restarting.
>>>
>>> # ps axuww |grep $(cat /cgroup/0/tasks )
>>> root      7231  0.1  0.0   1588    64 pts/0    D    16:57   0:00 ./test
>>> root      7238  0.0  0.1   2716   660 pts/1    R+   16:57   0:00 grep 7231
>>>
>>> under the /proc, one file descriptor opened, and it is /tmp/test.out
>>>
>>> #  ls -l /proc/$(cat /cgroup/0/tasks)/fd
>>> total 0
>>> lrwx------ 1 root root 64 Mar 16 16:58 0 -> /tmp/test.out
>>>
>>> Nhh, it's close..
>>>
>>> I found that when I mount cgroup with -o freezer, self_checkpoint won't work.
>>> It worked even I didn't mount the cgroup.
>>> Is it what you expect?
>> No, it is not.  Can you tell us more about exactly how it fails?
>>
> 
> OK, I've checked differences of dmesg when self_restart does well and doesn't.
> When it goes well, the filename is /tmp/cr-self.out
> 
> [  401.522556] [2307:2307:c/r:ckpt_read_fname:571] read filename '/tmp/cr-self.out'
> [  401.522558] [2307:2307:c/r:restore_open_fname:594] fname '/tmp/cr-self.out' flags 0x2

This means that restart wants to re-open the file /tmp/cr-self.out.
> 
> However, when the contents of file remains, filename is /tmp/cr-self.out.org,
> which is , of course, the one of original file binding to the original process.
> 
> [ 1088.414250] [2951:2951:c/r:ckpt_read_fname:571] read filename '/tmp/cr-self.out.orig'
> [ 1088.414253] [2951:2951:c/r:restore_open_fname:594] fname '/tmp/cr-self.out.orig' flags 0x2

This means that restart wants to re-open the file /tmp/cr-self.out.org.

Could it be that these two restart attempts use two distinct image files
as input ?

The first one seems to correspond to something like:
1) start the test, 2) checkpoint, 3) mv file and cp file, 4) restart

The second one seems to correspond to something like:
1) start the test, 2) mv file and ctp file, 3) checkpoint, 4) restart

What is the actual error reported when it doesn't work ?  (from restart
and from the kernel log)

> 
> I can not reproduce yet, but at least cgroup freezer option won't affect like I mentioned.
> Sorry that it might confuse you.
> 
> I still can not restart of external checkpoint.
> I'll try to v20 next time.

If it doesn't work, can you please describe again the exact order of
commands that you use and the reported error(s) ?

Oren.

> 
>> Maybe get the cr_tests (either from Oren's tree or from
>> git clone git://git.sr71.net/~hallyn/cr_tests.git), cd cr_test,
>> make, cd simple, run ./ckpt and send us the contents of
>> /tmp/log, dmesg, and ckptinfo -ve /tmp/out ?
> 
> I think it runs OK, but send it in case.
> /tmp/log was empty by the way.
> 
> thanks
> 
>>> Thank you again for the help!
>>> I'm feeling better to use the latest ..
>> -serge


More information about the Containers mailing list