Some very knowlegable people are saying "you don't need to try". I
disagree. With current software "base" we can get reasonably far.
Think about a kernel compile. 10 minutes, 1 CPU. Now a quick count
shows about 300 objects, so about 300 "gcc" jobs. That makes two
seconds each. If I've started "gcc" already, moving it should take
significantly less than those two seconds. You need 100mbps ethernet
for that. This is a "hard" case to do right.
The next thing is: How do you know which programs are likely to take
long? One way is to gather profiling info about programs. "ls" is
likely to be a "short" program. gcc can be expected to run from 1/10th
to several seconds. povray can be expected to run for hours. This
should lead to "hints" about these programs. Migration should only be
done if you expect to win from it. So you move a program off a node
only when you expect the benefit (not sharing the cpu with another
CPU-intensive program) will outweigh the cost (CPU & IO resources
spent moving the process).
Without resorting to the profiling, there is already something that
can be done: a process running for longer than 5 seconds is likely
to remain running for another 5 seconds. Sure the chances are it will
exit in the 100th of a second after the "move", but that's unlikely.
Measures about IO traffic can also be taken into account. A process
reading 3Mb per second out of a local file should not be migrated.
Now the technicalities.
TCP connections. Hmm. Masquerading? Someone suggested having a cluster
behind a Linux-router/firewall, but why not:
process process
donating accepting
machine machine
Tell acceptor that a process
is coming.
Describe fd's, one of them is
a socket. Open a random network socket.
tell donator the ip/portno of the
new socket.
Set up a masquerading entry for
the local socket to redirect it
to the remote socket. (Don't forget
to send whatever already is in
the buffer.)
Process starts executing here.
Pages are not yet transferred.
Pagefaults are treated as
"remote swap", and gotten from
the donating machine.
Alternatively the whole memory
image is tranferred.
Processes owning fd's to local devices might simply be locked to that
machine. ("Sorry: that tar keeps running on the tape server")
I'd suggest that we require the hosts to have the same filesytem view.
(a local disk on hosta needs to be mounted on the same mountpoint on
hostb as a NFS disk).
I probably missed a lot of issues. But implementing this is not all
that hard.
Roger.
-- ** R.E.Wolff@BitWizard.nl ** +31-15-2137555 ** http://www.BitWizard.nl/ ** Florida -- A 39 year old construction worker woke up this morning when a 109-car freight train drove over him. According to the police the man was drunk. The man himself claims he slipped while walking the dog. 080897