Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZdo?

From: Oren Laadan
Date: Wed Mar 18 2009 - 15:07:56 EST

Next message: Karel Zak: "[ANNOUNCE] util-linux-ng v2.15-rc1"
Previous message: Eric Dumazet: "Re: High contention on the sk_buff_head.lock"
In reply to: Mike Waychison: "Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZdo?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Mike Waychison wrote:
>>> Linus Torvalds wrote:
>>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>>
>>>>> Ying Han [yinghan@xxxxxxxxxx] wrote:
>>>>> | Hi Serge:
>>>>> | I made a patch based on Oren's tree recently which implement a new
>>>>> | syscall clone_with_pid. I tested with checkpoint/restart process
>>>>> tree
>>>>> | and it works as expected.
>>>>>
>>>>> Yes, I think we had a version of clone() with pid a while ago.
>>>> Are people _at_all_ thinking about security?
>>>>
>>>> Obviously not.
>>>>
>>>> There's no way we can do anything like this. Sure, it's trivial to
>>>> do inside the kernel. But it also sounds like a _wonderful_ attack
>>>> vector against badly written user-land software that sends signals
>>>> and has small races.
>>> I'm not really sure how this is different than a malicious app going
>>> off and spawning thousands of threads in an attempt to hit a target
>>> pid from a security pov. Sure, it makes it easier, but it's not like
>>> there is anything in place to close the attack vector.
>>>
>>>> Quite frankly, from having followed the discussion(s) over the last
>>>> few weeks about checkpoint/restart in various forms, my reaction to
>>>> just about _all_ of this is that people pushing this are pretty damn
>>>> borderline.
>>>> I think you guys are working on all the wrong problems.
>>>> Let's face it, we're not going to _ever_ checkpoint any kind of
>>>> general case process. Just TCP makes that fundamentally impossible
>>>> in the general case, and there are lots and lots of other cases too
>>>> (just something as totally _trivial_ as all the files in the
>>>> filesystem that don't get rolled back).
>>> In some instances such as ours, TCP is probably the easiest thing to
>>> migrate. In an rpc-based cluster application, TCP is nothing more
>>> than an RPC channel and applications already have to handle RPC
>>> channel failure and re-establishment.
>>>
>>> I agree that this is not the 'general case' as you mention above
>>> however. This is the bit that sorta bothers me with the way the
>>> implementation has been going so far on this list. The
>>> implementation that folks are building on top of Oren's patchset
>>> tries to be everything to everybody. For our purposes, we need to
>>> have the flexibility of choosing *how* we checkpoint. The line seems
>>> to be arbitrarily drawn at the kernel being responsible for
>>> checkpointing and restoring all resources associated with a task, and
>>> leaving userland with nothing more than transporting filesystem
>>> bits. This approach isn't flexible enough: Consider the case where
>>> we want to stub out most of the TCP file descriptors with
>>> ECONNRESETed sockets because we know that they are RPC sockets and
>>> can re-establish themselves, but we want to use some other mechanism
>>> for TCP sockets we don't know much about. The current monolithic
>>> approach has zero flexibility for doing anything like this, and I
>>> figure out how we could even fit anything like this in.
>>
>> The flexibility exists, but wasn't spelled out, so here it is:
>>
>> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
>> something about specific resources, e.g.:
>> * cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's
>> scratch
>> * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection
>> on restart
>> etc .. (nevermind the exact interface right now)
>>
>> 2) Tasks can ask to be notified (e.g. register a signal) when a
>> checkpoint
>> or a restart complete successfully. At that time they can do their
>> private
>> house-keeping if they know better.
>>
>> 3) If restoring some resource is significantly easier in user space
>> (e.g. a
>> file-descriptor of some special device which user space knows how to
>> re-initialize), then the restarting task can prepare it ahead of time,
>> and, call:
>> * cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of
>> trying
>> to restore it yourself.
>
> This would be called by the embryo process (mktree.c?) before calling
> sys_restart?

Yes.

>
>>
>> Method #3 is what I used in Zap to implement distributed checkpoints,
>> where
>> it is so much easier to recreate all network connections in user space
>> then
>> putting that logic into the kernel.
>>
>> Now, on the other hand, doing the c/r from userland is much less flexible
>> than in the kernel (e.g. epollfd, futex state and much more) and requires
>> exposing tremendous amount of in-kernel data to user space. And we all
>> know
>> than exposing internals is always a one-way ticket :(
>>
>> [...]
>>
>> Oren.
>>
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Karel Zak: "[ANNOUNCE] util-linux-ng v2.15-rc1"
Previous message: Eric Dumazet: "Re: High contention on the sk_buff_head.lock"
In reply to: Mike Waychison: "Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZdo?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]