How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Oren Laadan orenl at
Wed Mar 18 12:04:01 PDT 2009

Mike Waychison wrote:
> Oren Laadan wrote:
>> Mike Waychison wrote:
>>> Linus Torvalds wrote:
>>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>>> Ying Han [yinghan at] wrote:
>>>>> | Hi Serge:
>>>>> | I made a patch based on Oren's tree recently which implement a new
>>>>> | syscall clone_with_pid. I tested with checkpoint/restart process
>>>>> tree
>>>>> | and it works as expected.
>>>>> Yes, I think we had a version of clone() with pid a while ago.
>>>> Are people _at_all_ thinking about security?
>>>> Obviously not.
>>>> There's no way we can do anything like this. Sure, it's trivial to
>>>> do inside the kernel. But it also sounds like a _wonderful_ attack
>>>> vector against badly written user-land software that sends signals
>>>> and has small races.
>>> I'm not really sure how this is different than a malicious app going
>>> off and spawning thousands of threads in an attempt to hit a target
>>> pid from a security pov.  Sure, it makes it easier, but it's not like
>>> there is anything in place to close the attack vector.
>>>> Quite frankly, from having followed the discussion(s) over the last
>>>> few weeks about checkpoint/restart in various forms, my reaction to
>>>> just about _all_ of this is that people pushing this are pretty damn
>>>> borderline.
>>>> I think you guys are working on all the wrong problems.
>>>> Let's face it, we're not going to _ever_ checkpoint any kind of
>>>> general case process. Just TCP makes that fundamentally impossible
>>>> in the general case, and there are lots and lots of other cases too
>>>> (just something as totally _trivial_ as all the files in the
>>>> filesystem that don't get rolled back).
>>> In some instances such as ours, TCP is probably the easiest thing to
>>> migrate.  In an rpc-based cluster application, TCP is nothing more
>>> than an RPC channel and applications already have to handle RPC
>>> channel failure and re-establishment.
>>> I agree that this is not the 'general case' as you mention above
>>> however.  This is the bit that sorta bothers me with the way the
>>> implementation has been going so far on this list.  The
>>> implementation that folks are building on top of Oren's patchset
>>> tries to be everything to everybody.  For our purposes, we need to
>>> have the flexibility of choosing *how* we checkpoint.  The line seems
>>> to be arbitrarily drawn at the kernel being responsible for
>>> checkpointing and restoring all resources associated with a task, and
>>> leaving userland with nothing more than transporting filesystem
>>> bits.  This approach isn't flexible enough:   Consider the case where
>>> we want to stub out most of the TCP file descriptors with
>>> ECONNRESETed sockets because we know that they are RPC sockets and
>>> can re-establish themselves, but we want to use some other mechanism
>>> for TCP sockets we don't know much about.  The current monolithic
>>> approach has zero flexibility for doing anything like this, and I
>>> figure out how we could even fit anything like this in.
>> The flexibility exists, but wasn't spelled out, so here it is:
>> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
>> something about specific resources, e.g.:
>>  * cradvice(CR_ADV_MEM, ptr, len)  -> don't save that memory, it's
>> scratch
>>  * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET)  -> reset connection
>> on restart
>> etc .. (nevermind the exact interface right now)
>> 2) Tasks can ask to be notified (e.g. register a signal) when a
>> checkpoint
>> or a restart complete successfully. At that time they can do their
>> private
>> house-keeping if they know better.
>> 3) If restoring some resource is significantly easier in user space
>> (e.g. a
>> file-descriptor of some special device which user space knows how to
>> re-initialize), then the restarting task can prepare it ahead of time,
>> and, call:
>>   * cradvice(CR_ADV_USERFD, fd, 0)  -> use the fd in place instead of
>> trying
>>                        to restore it yourself.
> This would be called by the embryo process (mktree.c?) before calling
> sys_restart?


>> Method #3 is what I used in Zap to implement distributed checkpoints,
>> where
>> it is so much easier to recreate all network connections in user space
>> then
>> putting that logic into the kernel.
>> Now, on the other hand, doing the c/r from userland is much less flexible
>> than in the kernel (e.g. epollfd, futex state and much more) and requires
>> exposing tremendous amount of in-kernel data to user space. And we all
>> know
>> than exposing internals is always a one-way ticket :(
>> [...]
>> Oren.

More information about the Containers mailing list