Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

From: Oren Laadan
Date: Fri Nov 05 2010 - 19:18:41 EST




On 11/05/2010 05:28 AM, Tejun Heo wrote:
Hello,

On 11/04/2010 05:44 PM, Gene Cooperman wrote:
In our personal view, a key difference between in-kernel and userland
approaches is the issue of security.

That's an interesting point but I don't think it's a dealbreaker.
... but it's not like CR is gonna be deployed on
majority of desktops and servers (if so, let's talk about it then).

This is a good point to clarify some issues. C/R has several good
targets. For example, BLCR has targeted HPC batch facilities, and
does it well.

DMTCP started life on the desktop, and it's still a primary focus of
DMTCP. We worked to support screen on this release precisely so
that advanced desktop users have the option of putting their whole
screen session under checkpoint control. It complements the core
goal of screen: If you walk away from a terminal, you can get back
the session elsewhere. If your session crashes, you can get back
the session elsewhere (depending on where you save the checkpoint
files, of course :-) ).

Call me skeptical but I still don't see, yet, it being a mainstream
thing (for average sysadmin John and proverbial aunt Tilly). It
definitely is useful for many different use cases tho. Hey, but let's
see.

These are also some excellent points for discussion! The manager thread
is visible. For example, if you run a gdb session under checkpoint
control (only available in our unstable branch, currently), then
the gdb session will indeed see the checkpoint manager thread.

I don't think gdb seeing it is a big deal as long as it's hidden from
the application itself.

We try to hid the reserved signal (SIGUSR2 by default, but the user
can configure it to anything else). We put wrappers around system
calls that might see our signal handler, but I'm sure there are
cases where we might not succeed --- and so a skilled user would
have to configure to use a different signal handler. And of course,
there is the rare application that repeatedly resets _every_ signal.
We encountered this in an earlier version of Maple, and the Maple
developers worked with us to open up a hole so that we could
checkpoint Maple in future versions.

[while] all programs should be ready to handle -EINTR failure from system
calls, it's something which is very difficult to verify and test and
could lead to once-in-a-blue-moon head scratchy kind of failures.

Exactly right! Excellent point. Perhaps this gets down to
philosophy, and what is the nature of a bug. :-) In some cases, we
have encountered this issue. Our solution was either to refuse to
checkpoint within certain system calls, or to check the return value
and if there was an -EINTR, then we would re-execute the system
call. This works again, because we are using wrappers around many
(but not all) of the system calls.

I'm probably missing something but can't you stop the application
using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
about -EINTR failures (there are some exceptions but nothing really to
worry about). Also, unless the manager thread needs to be always
online, you can inject manager thread by manipulating the target
process states while taking a snapshot.

This is an excellent example to demonstrate several points:

* To freeze the processes, you can use (quote) "hairy" signal
overload mechanism, or even more hairy ptrace; both by the way
have their performance problem with many processes/threads.
Or you can use the in-kernel freezer-cgroup, and forget about workarounds, like linux-cr does. And ~200 lines in said diff
are dedicated exactly to that.

* Then, because both the workaround and the entire philosophy
of MTCP c/r engine is that affected processes _participate_ in
the checkpoint, their syscalls _must_ be interrupted. Contrastly,
linux-cr kernel approach allows not only to checkpoint processes
without collaboration, but also builds on the native signal
handling kernel code to restart the system calls (both after
unfreeze, and after restart), such that the original process
does not observe -EINTR.

But since you ask :-), there is one thing on our wish list. We
handle address space randomization, vdso, vsyscall, and so on quite
well. We do not turn off address space randomization (although on
restart, we map user segments back to their original addresses).
Probably the randomized value of brk (end-of-data or end of heap) is
the thing that gave us the most troubles and that's where the code
is the most hairy.

Can you please elaborate a bit? What do you want to see changed?

Aha ... another great example: yet another piece of the suspect
diff in question is dedicated to allow a restarting process to
request a specific location for the vdso.

BTW, a real security expert (and I'm not one...) may argue that
this operation should only be allowed to privileged users. In fact,
if your code gets around the linux ASLR mechanisms, then someone
should fix the kernel ASLR code :)

The implementation is reasonably modularized. In the rush to
address bugs or feature requirements of users, we sometimes cut
corners. We intend to go back and fix those things. Roughly, the
architecture of DMTCP is to do things in two layers: MTCP handles a
single multi-threaded process. There is a separate library mtcp.so.
The higher layer (redundantly again called DMTCP) is implemented in
dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of
what would be done within kernel C/R. But the higher DMTCP layer
takes on some of those responsibilities in places. For example,
DMTCP does part of analyzing the pseudo-ttys, since it's not always
easy to ensure that it's the controlling terminal of some process
that can checkpoint things in the MTCP layer.

Beyond that, the wrappers around system calls are essentially
perfectly modular. Some system calls go together to support a
single kernel feature, and those wrappers are kept in a common file.

I see. I just thought that it would be helpful to have the core part
- which does per-process checkpointing and restoring and corresponds
to the features implemented by in-kernel CR - as a separate thing. It
already sounds like that is mostly the case.

FWIW, the restart portion of linux-cr is designed with this in
mind - it is flexible enough to accommodate for smart userspace
tools and wrappers that wish to mock with the processes and
their resource post-restart (but before the processes resume
execution). For example, a distributed checkpoint tool could,
at restart time, reestablish the necessary network connections
(which is much different than live migration of connections,
and clearly not a kernel task). This way, it is trivial to migrate
a distributed application from one set of hosts to another, on
different networks, with very little effort.


I don't have much idea about the scope of the whole thing, so please
feel free to hammer senses into me if I go off track. From what I
read, it seems like once the target process is stopped, dmtcp is able
to get most information necessary from kernel via /proc and other
methods but the paper says that it needs to intercept socket related
calls to gather enough information to recreate them later. I'm
curious what's missing from the current /proc. You can map socket to
inode from /proc/*/fd which can be matched to an entry in
/proc/*/net/PROTO to find out the addresses and most socket options
should be readable via getsockopt. Am I missing something?

So you'll need mechanisms not only to read the data at checkpoint
time but also to reinstate the data at restart time. By the time
you are done, the kernel all the c/r code (the suspect diff in
question _and_ the rest of the logic) in the form of new interfaces
and ABIs to usersapce...; the userspace code will grow some more
hair; and there will be zero maintainability gain. And at the same
you won't be able to leverage optimizations only possible in the
kernel.


I think this is why userland CR implementation makes much more sense.
Most of states visible to a userland process are rather rigidly
defined by standards and, ultimately, ABI and the kernel exports most
of those information to userland one way or the other. Given the
right set of needed features, most of which are probabaly already
implemented, a userland implementation should have access to most
information necessary to checkpoint without resorting to too messy
methods and then there inevitably needs to be some workarounds to make
CR'd processes behave properly w.r.t. other states on the system, so
userland workarounds are inevitable anyway unless it resorts to

To be precise, there are three types of userland workarounds:

1) userland workarounds to make a restarted application work when
peer processrs aren't saved - e.g, in distributed checkpoint you
need a workaround to rebuild the socket to the peer; or in his
example with the 'ncsd' daemon from earlier in the thread.

These are needed regardless of the c/r engine of choice. In many
cases they can be avoided if applications are run in containers.
(which can be as simple as running a program using 'nohup')

2) userland workarounds to duplicate virtualization logic already
done by the kernel - like the userspace pid-namespace and the
complex logic and hacks needed to make it work. This is completely
unnecessary when you do kernel c/r.

3) userland workarounds to compensate for the fact that userspace
can't get or set some state during checkpoint or restart. For
example, in the kernel it's trivial to track shared files. How
would you say, from userspace, if fd[0] of parent A and child B is
the same file opened and then inherited, or the same filename
opened twice individually ? For files, it is possible to figure
this out in user space, e.g. by intercepting and tracking all forks
and all file operations (including passing fd's via afunix sockets).
There are other hairy ways to do it, but not quite so for other
resources.

As another example, consider SIDs and PGIDs. With proper algorithms
you can ensure that your processes get the right SID at fork time.
But in the general case, you can't reproduce PGIDs accurately
without replaying what the processes (including those that had died
already) behaved.

And to track zombies at checkpoint, you'd need to actually collect
them, so you must do it in a hairy wrapper, and keep the secret
until the application calls wait(). But then, there may be some
side effects due to collecting zombies, e.g. the pid may be reused
against the application's expectation.

Some of these have workarounds, some not. Do you really think that
re-reimplementing linux and namespaces in userspace is the way to go ?

Then, you can add to the kernel endless amount of interfaces to
export all of this - both data, and the functionality to re-instate
this data at checkpoint. But ... wait -- isn't that what linux-cr
already does ?

preemtive separation using namespaces and containers, which I frankly
think isn't much of value already and more so going forward.

That is one opinion. Then there are people using VPSs in commercial
and private environments, for example.

VMs are wonderful (re)invention. Regardless of any one single
person's about VMs vs containers, both are here to stay, and both
have their use-cases and users. IMHO, it is wrong to ignore the
need for c/r and migration capabilities for containers, whether
they run full desktop environments, multiple applications or single
processes.

Oren.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/