[RFC][PATCH] ns: Syscalls for better namespace sharing control.
daniel.lezcano at free.fr
Sun Feb 28 14:05:53 PST 2010
Eric W. Biederman wrote:
> Pavel Emelyanov <xemul at parallels.com> writes:
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul at parallels.com> writes:
>>>> Eric W. Biederman wrote:
>>>>> Pavel Emelyanov <xemul at parallels.com> writes:
>>>>>> Thanks. What's the problem with setns?
>>>>> joining a preexisting namespace is roughly the same problem as
>>>>> unsharing a namespace. We simply haven't figure out how to do it
>>>>> safely for the pid and the uid namespaces.
>>>> The pid may change after this for sure. What problems do you know
>>>> about it? What if we try to allocate the same PID in a new space
>>>> or return -EBUSY? This will be a good starting point. If we manage
>>>> to fix it later this will not break the API at all.
>>> Parentage. The pid is the identity of a process and all kinds of things
>>> make assumptions in all kinds of strange places. I don't see how
>>> waitpid can work if you change the pid.
>> Agree. But what if we enter a pid space, which is a subnamespace of a current
>> one? In that case parent will still see the task by its old pid. We can restrict
>> first version of entering with this rule as well and this restriction will not
>> block us in typical usecase (I mean enter a container from a host).
> When I was thinking about pid namespaces and unshare last time. The idea I came
> to was we unshare of the pid namespace should only affect which pid namespace
> your children are in.
> I remember that do that there were a few cases where you would have to access
> task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
>>> glibc doesn't cope if you change someones pid.
>> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>> My aim is to provide even a restricted enter. For most of the cases this
>> should work and make our lives easier. So two restrictions currently:
>> a) enter a sub namespace
>> b) allocate the same pid as we have now
>> Hm? :)
> Replacing struct pid is guaranteed to do all kinds of nasty things with
> signal handling and the like, de_thread is nasty enough and you are talking
> something worse. So if we can change pid namespaces without changing
> the pid I am for it.
I agree with all the points you and Pavel you talked about but I don't
feel comfortable to have the current process to switch the pid namespace
because of the process tree hierarchy (what will be the parent of the
process when you enter the pid namespace for example). What is the
difference with the sys_bindns or the sys_hijack, proposed a couple of
years ago ?
I did a suggestion some weeks ago about a new syscall 'cloneat' where
the child process becomes the child of the targeted process specified in
the syscall. Maybe it would be interesting to replace the 'setns' by, or
add, a 'cloneat' syscall with the file descriptor passed as parameter.
The copy_process function shall not use the nsproxy of the caller but
the one provided in the fd argument.
The newly created process becomes the child of the process where we
retrieve the namespace with nsfd and this one have to 'waitpid' it, (the
caller of 'cloneat' can not wait it). It's a bit similar with the
CLONE_PARENT flag, except the creation order is inverted (the father
creates for the child).
So when entering the container, we specify the pid 1 of the container
which is usually a child reaper.
Does it make sense ?
More information about the Containers