How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
orenl at cs.columbia.edu
Wed Mar 18 12:04:01 PDT 2009
Mike Waychison wrote:
> Oren Laadan wrote:
>> Mike Waychison wrote:
>>> Linus Torvalds wrote:
>>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>>> Ying Han [yinghan at google.com] wrote:
>>>>> | Hi Serge:
>>>>> | I made a patch based on Oren's tree recently which implement a new
>>>>> | syscall clone_with_pid. I tested with checkpoint/restart process
>>>>> | and it works as expected.
>>>>> Yes, I think we had a version of clone() with pid a while ago.
>>>> Are people _at_all_ thinking about security?
>>>> Obviously not.
>>>> There's no way we can do anything like this. Sure, it's trivial to
>>>> do inside the kernel. But it also sounds like a _wonderful_ attack
>>>> vector against badly written user-land software that sends signals
>>>> and has small races.
>>> I'm not really sure how this is different than a malicious app going
>>> off and spawning thousands of threads in an attempt to hit a target
>>> pid from a security pov. Sure, it makes it easier, but it's not like
>>> there is anything in place to close the attack vector.
>>>> Quite frankly, from having followed the discussion(s) over the last
>>>> few weeks about checkpoint/restart in various forms, my reaction to
>>>> just about _all_ of this is that people pushing this are pretty damn
>>>> I think you guys are working on all the wrong problems.
>>>> Let's face it, we're not going to _ever_ checkpoint any kind of
>>>> general case process. Just TCP makes that fundamentally impossible
>>>> in the general case, and there are lots and lots of other cases too
>>>> (just something as totally _trivial_ as all the files in the
>>>> filesystem that don't get rolled back).
>>> In some instances such as ours, TCP is probably the easiest thing to
>>> migrate. In an rpc-based cluster application, TCP is nothing more
>>> than an RPC channel and applications already have to handle RPC
>>> channel failure and re-establishment.
>>> I agree that this is not the 'general case' as you mention above
>>> however. This is the bit that sorta bothers me with the way the
>>> implementation has been going so far on this list. The
>>> implementation that folks are building on top of Oren's patchset
>>> tries to be everything to everybody. For our purposes, we need to
>>> have the flexibility of choosing *how* we checkpoint. The line seems
>>> to be arbitrarily drawn at the kernel being responsible for
>>> checkpointing and restoring all resources associated with a task, and
>>> leaving userland with nothing more than transporting filesystem
>>> bits. This approach isn't flexible enough: Consider the case where
>>> we want to stub out most of the TCP file descriptors with
>>> ECONNRESETed sockets because we know that they are RPC sockets and
>>> can re-establish themselves, but we want to use some other mechanism
>>> for TCP sockets we don't know much about. The current monolithic
>>> approach has zero flexibility for doing anything like this, and I
>>> figure out how we could even fit anything like this in.
>> The flexibility exists, but wasn't spelled out, so here it is:
>> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
>> something about specific resources, e.g.:
>> * cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's
>> * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection
>> on restart
>> etc .. (nevermind the exact interface right now)
>> 2) Tasks can ask to be notified (e.g. register a signal) when a
>> or a restart complete successfully. At that time they can do their
>> house-keeping if they know better.
>> 3) If restoring some resource is significantly easier in user space
>> (e.g. a
>> file-descriptor of some special device which user space knows how to
>> re-initialize), then the restarting task can prepare it ahead of time,
>> and, call:
>> * cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of
>> to restore it yourself.
> This would be called by the embryo process (mktree.c?) before calling
>> Method #3 is what I used in Zap to implement distributed checkpoints,
>> it is so much easier to recreate all network connections in user space
>> putting that logic into the kernel.
>> Now, on the other hand, doing the c/r from userland is much less flexible
>> than in the kernel (e.g. epollfd, futex state and much more) and requires
>> exposing tremendous amount of in-kernel data to user space. And we all
>> than exposing internals is always a one-way ticket :(
More information about the Containers