c/r failure

Matt Helsley matthltc at us.ibm.com
Wed Mar 23 17:52:30 PDT 2011


On Mon, Mar 07, 2011 at 01:35:03PM -0600, Serge E. Hallyn wrote:
> Hey,
> 
> I'm using ckpt-v23-rc1-pids branches of both linux-cr and user-cr (with
> a few trivial fixes, packaged at
> https://launchpad.net/~appcr/+archive/ppa).  Checkpoint is being done
> using
> 	/usr/bin/appcheckpoint -N -l $vardir/log $pid > $vardir/ckpt
> and restart with
> 	/usr/bin/apprestart -vd --mntns --mount-pty --pids -l $vardir/rlog < $vardir/ckpt
> 
> The resulting checkpoint file is attached as 'ckpt', the result
> of 'ckptinfo -vp $ckpt' is in ckpt-ckptinfo, and the kernel log
> (partial) from the checkpoint operation is in kernel.log.  When I
> restart, it fails with console output shown in restart.console.out.

Hi Serge,

	Unfortunately ckptinfo is broken. My most recent series of 18 patches
to user-cr fixes it. I've attached the output using an earlier version
of the kernel headers than found in the -pids branch. Consequently there are a
couple "UNKOWN" hdrs but otherwise it should provide you with much
more useful information since the broken version fails to iterate over
the entire checkpoint file.

<snip>

> <1012>====== TASKS
> <1012>  [  0] pid     0(  1) (tgid   1) ppid     0(  0)  (pgid -4097) sid    -1(-4097) creator     0(  0)
> <1012>  [  1] pid     0(  2) (tgid   2) ppid     0(  1)  (pgid   2) sid     0(  2) creator     0(  0) prev 301
> <1012>  [  2] pid     0(  3) (tgid   3) ppid     0(  1)  (pgid -4097) sid    -1(-4097) creator   301(  0)   S
> <1012>  [  3] pid     0(  4) (tgid   4) ppid     0(  1)  (pgid   4) sid     0(  4) creator     0(  0) next 301 prev   0      
> <1012>  [  4] pid     0(  5) (tgid   5) ppid     0(  4)  (pgid   5) sid     0(  5) creator     0(  0) next   0 prev   0      
> <1012>  [  5] pid     0(  6) (tgid   6) ppid     0(  5)  (pgid   5) sid     0(  5) creator     0(  0) next   0 placeholder 301
> <1012>  [  6] pid   301(  7) (tgid  -1) ppid     0(  1)  (pgid  -1) sid    -1(-4097) creator     0(  0) next   0 prev   0      D
> <1012>............
> <1012>new pidns without init
> <1012>forking coordinator in new pidns
> <1>fork child vpid 0 flags 0x1
> <1>task 0 forking with flags 11 numpids 1
> <1>task 0 pids: 0
> <1>...
> <1>forked child vpid 2 (asked 0)
> failed to create specific pid with eclone
> <1013>====== PIDS ARRAY
> <1013>[0] pid 0 depth 0
> <1013>[1] pid 0 depth 0
> <1013>[2] pid 0 depth 0
> <1013>[3] pid 0 depth 0
> <1013>[4] pid 0 depth 0
> <1013>[5] pid 0 depth 0
> <1013>............
> <1012>Coordinator failed to report status
> <1012>SIGCHLD: child not ready
> <1012>SIGCHLD: already collected
> <1012>c/r failed ?
> restart: Input/output error
> Restart failed
> root at cr-natty-i386:/home/serge#

I've been having trouble with earlier versions of user-cr than the -pids
branch. The symptoms were that sys_restart() failed to read a single
byte of the checkpoint image -- despite the stdio output of the restart
userspace program above. To see if you're experiencing the same problem,
look at dmesg and see that it dies at "pos 0" in the checkpoint file with
"Expecting type 1" -- that indicates it hasn't even read the first hdr!

It seems like the "fork feeders" aren't outputting anything and/or the pipes
are hooked up wrong in user-cr's restart, because I get EPIPE right at the
beginning. This was masked by some troubles with ckpt_err() which I still
haven't been able to figure out -- the "error" reported in dmesg above
is -512 but really should be -32 (EPIPE). You rewrote that code, didn't you? ;)

Anyhow, today I managed to bisect my user-cr troubles down to:

5b97422c4c1342a128df508cda7c4639ecb24a36

Revert was not clean and the resolved conflicts on top of the revert did
not seem to fix the problem, soo... :/ all I can recommend at the moment is
sticking to the earlier branches.

Hope that helps. Again, sorry for the delay.

Cheers,
	-Matt Helsley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: serge-ckpt.img.info
Type: application/x-info
Size: 61995 bytes
Desc: ckptinfo output
Url : http://lists.linux-foundation.org/pipermail/containers/attachments/20110323/f16928c8/attachment-0001.bin 


More information about the Containers mailing list