Re: Process Migration on Linux - Impossible?

Rogier Wolff (R.E.Wolff@BitWizard.nl)
Wed, 1 Oct 1997 23:40:18 +0200 (MET DST)


linux kernel account wrote:
>
>
>
> On 1 Oct 1997, Ketil Z Malde wrote:
>
> > > If you are in the least work for the most bang mode, then you don't do
> > > process migration.
> >
> > But this is Linux, we have an infinite number of monkeys :-)
>
> <rtfl> I havn't had such a good laff off of this list for awhile.. Kernel
> Bloat is a concern.. Process migration, and remote execution should end up
> being differnt config options..
>
> > > As I've repeatedly said, the dynamic changes are well handled by small SMPs.
> >
> > Yes, well, okay. But I want to cluster a couple of old 486es...? Oh
> > well.
>
> Thats my thought exactly..
>
> > > As has been repeatedly proven, moving an already started process is a lose
> > > almost 100% of the time.
> >
> > Right. Unless somebody just typed shutdown -hnow at the #-rompt.
>
> Sigh, most of the *proof* that I have seen discussed here is regarding
> compiling.. Parallel making is probably the most widly used task that
> actually gets a big increase from multiple cpus, but it by far not the
> only one, or most important one. In a parallel make, migration would be
> stupid, the tasks are very short lived so as long as they are placed on
> the least loaded computer then it all works.. What about a longer lived
> processes, ones that can't easily just save their state and continue
> later..
>
> If some fools are playing quake on two nodes of your 4 node
> cluster, and you start 4 rc5 daemons, then they are stuck on the other two
> boxes.. RC5 isn't a good example because it can easily be restarted.. How
> about a 16node cluster, some developer starts a parallel make -j10 and
> then someone starts up a long running weather simulation.. It could end up
> only running on 6 of the computers, with no hope of moving them.. At least
> with migration you would have a chance of moving them..

Some very knowlegable people are saying "you don't need to try". I
disagree. With current software "base" we can get reasonably far.

Think about a kernel compile. 10 minutes, 1 CPU. Now a quick count
shows about 300 objects, so about 300 "gcc" jobs. That makes two
seconds each. If I've started "gcc" already, moving it should take
significantly less than those two seconds. You need 100mbps ethernet
for that. This is a "hard" case to do right.

The next thing is: How do you know which programs are likely to take
long? One way is to gather profiling info about programs. "ls" is
likely to be a "short" program. gcc can be expected to run from 1/10th
to several seconds. povray can be expected to run for hours. This
should lead to "hints" about these programs. Migration should only be
done if you expect to win from it. So you move a program off a node
only when you expect the benefit (not sharing the cpu with another
CPU-intensive program) will outweigh the cost (CPU & IO resources
spent moving the process).

Without resorting to the profiling, there is already something that
can be done: a process running for longer than 5 seconds is likely
to remain running for another 5 seconds. Sure the chances are it will
exit in the 100th of a second after the "move", but that's unlikely.

Measures about IO traffic can also be taken into account. A process
reading 3Mb per second out of a local file should not be migrated.

Now the technicalities.

TCP connections. Hmm. Masquerading? Someone suggested having a cluster
behind a Linux-router/firewall, but why not:

process process
donating accepting
machine machine

Tell acceptor that a process
is coming.
Describe fd's, one of them is
a socket. Open a random network socket.
tell donator the ip/portno of the
new socket.
Set up a masquerading entry for
the local socket to redirect it
to the remote socket. (Don't forget
to send whatever already is in
the buffer.)

Process starts executing here.
Pages are not yet transferred.
Pagefaults are treated as
"remote swap", and gotten from
the donating machine.
Alternatively the whole memory
image is tranferred.

Processes owning fd's to local devices might simply be locked to that
machine. ("Sorry: that tar keeps running on the tape server")

I'd suggest that we require the hosts to have the same filesytem view.
(a local disk on hosta needs to be mounted on the same mountpoint on
hostb as a NFS disk).

I probably missed a lot of issues. But implementing this is not all
that hard.

Roger.

-- 
** R.E.Wolff@BitWizard.nl ** +31-15-2137555 ** http://www.BitWizard.nl/ **
Florida -- A 39 year old construction worker woke up this morning when a
109-car freight train drove over him. According to the police the man was 
drunk. The man himself claims he slipped while walking the dog. 080897