[RFC][v3][PATCH 7/7] Define clone_with_pids syscall

Oren Laadan orenl at cs.columbia.edu
Mon Jun 1 09:58:15 PDT 2009



Oren Laadan wrote:
> 
> Serge E. Hallyn wrote:
>> Quoting Sukadev Bhattiprolu (sukadev at linux.vnet.ibm.com):
>>
>> Bear with me as I try to get this straight...
>>
>> Let's say we have the following task being checkpointed (btw, the
>> actual checkpoint code hopefully will, as Oren mentioned, not
>> be always checkpointing all of a task's pids, but only the
>> pids from the checkpointer's pidns level an up - but that doesn't
>> matter to you):
>>
>> Either:
>>
>> level no    |    pid
>> 0               1001
>> 1                50
>> 2                 3
>>
>> or
>>
>> level no    |    pid
>> 0               1001
>> 1                50
>> 2                 1
>>
>> Now we want to restart that inside a container.  So for the
>> first task we have a parent task:
>>
>> level no       |     pid
>> 0                    5009
>> 1                    1000
>> 2                     49
>> 3                      2
>>
>> calling clone_with_pid with upids {1001,50,3} to produce task:
>>
>> level no       |     pid
>> 0                   5010
>> 1                   1001
>> 2                    50
>> 3                     3
>>
>> So the numbers in your code for this case become:
>>
>>> +	unum_pids = pid_set.num_pids;
>>         unum_pids = 3
>>> +	knum_pids = task_pid(current)->level + 1;
>>         knum_pids = 3 + 1 = 4
>>
>>> +	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
>> XXX First problem:
>>         target_pids gets room for 5 pids, one more than we need.
> 
> The two issues are related, and this is intentional. The idea
> was that when you use CLONE_NEWPID, you imply a new nesting level
> and a pid==1 there. So you are not supposed to specify the pid
> of that new level.
> 
> IOW, the parent specify the pid's of the child from the _parent's_
> level and up (of the desired depth). CLONE_NEWPID creates a new
> pidns level below the parent's, and that is not covered in the
> array of pids.
> 
> By allocation an extra slot and forcing it to be 0, we ensure that
> the case of CLONE_NEWPID is covered correctly. Clearly, if this
> flag isn't set, then the extra slot is redundant (but doesn't hurt).
> 
>>> +	j = knum_pids - unum_pids;
>>         j = 1, so we copy the 3 pids in the right place.
>>
>>> +	rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
>>> +	if (rc) {
>>> +		rc = -EFAULT;
>>> +		goto out_free;
>>> +	}
>>> +
>>> +	return target_pids;
>>
>> For the second one, we have a parent task
>>
>> level no       |     pid
>> 0                   5009
>> 1                   1000
>> 2                    49
>>
>> calling clone_with_pid with CLONE_NEWPID and {1001,50,1} to produce:
>>
>> level no       |     pid
>> 0                   5010
>> 1                   1001
>> 2                    50
>> 3                    1
>>
>> So the numbers in your code become:
>>
>>> +	unum_pids = pid_set.num_pids;
>>         unum_pids = 3
> 
> This is a "bug" of the parent. The parent should specify the pids
> from the parent's level only and up, and not include the new level
> below that will be created. (After all, it will have to be 1).

Suka: maybe state this explicitly,too, in the patch description
and in the comments in the code ?   Like:

"The syscall operates relative to the parent's nesting level. If
CLONE_NEWPID is used as well, it will create an additional level
below. In this case the syscall indicates the desired pids at the
nesting levels seen by the parent."

Oren.



More information about the Containers mailing list