Re: Process Migration on Linux - Impossible?
kwrohrer@enteract.com
Thu, 2 Oct 1997 02:09:46 -0500 (CDT)
And lo, Rogier Wolff saith unto me:
[snip...]
> > cluster, and you start 4 rc5 daemons, then they are stuck on the other two
> > boxes.. RC5 isn't a good example because it can easily be restarted.. How
> > about a 16node cluster, some developer starts a parallel make -j10 and
> > then someone starts up a long running weather simulation.. It could end up
> > only running on 6 of the computers, with no hope of moving them.. At least
> > with migration you would have a chance of moving them..
>
> Some very knowlegable people are saying "you don't need to try". I
> disagree. With current software "base" we can get reasonably far.
>
> Think about a kernel compile. 10 minutes, 1 CPU. Now a quick count
> shows about 300 objects, so about 300 "gcc" jobs. That makes two
> seconds each. If I've started "gcc" already, moving it should take
> significantly less than those two seconds. You need 100mbps ethernet
> for that. This is a "hard" case to do right.
These gcc's should be distributed before they even start. Parallel
make is not really susceptible to migration given modern code file
sizes and modern compiler speeds.
> The next thing is: How do you know which programs are likely to take
> long? One way is to gather profiling info about programs. "ls" is
> likely to be a "short" program. gcc can be expected to run from 1/10th
> to several seconds. povray can be expected to run for hours. This
> should lead to "hints" about these programs. Migration should only be
> done if you expect to win from it. So you move a program off a node
> only when you expect the benefit (not sharing the cpu with another
> CPU-intensive program) will outweigh the cost (CPU & IO resources
> spent moving the process).
Hints about programs aside, after a few seconds of consistent CPU
burst (or nearly consistent) you know that it's probably going to
do more CPU burst for at least a while. Temporal locality and all
that.
> Without resorting to the profiling, there is already something that
> can be done: a process running for longer than 5 seconds is likely
> to remain running for another 5 seconds. Sure the chances are it will
> exit in the 100th of a second after the "move", but that's unlikely.
Yup. Likewise, the process starting it up can request that it prefer
to start remotely.
> Measures about IO traffic can also be taken into account. A process
> reading 3Mb per second out of a local file should not be migrated.
Then it won't be exceeding most of its timeslices, unless it's reading
from /dev/zero (in which case it's not really doing I/O at all).
> Now the technicalities.
>
> TCP connections. Hmm. Masquerading? Someone suggested having a cluster
> behind a Linux-router/firewall, but why not:
At least the user-level solution would use proxies on the donor
machine, acceptor machine, and remote machine(s) (or router/bridge/
gateway, depending):
> process process
> donating accepting
> machine machine
>
> Tell acceptor that a process
> is coming.
> Describe fd's, one of them is
> a socket. Open a random network socket.
> tell donator the ip/portno of the
> new socket.
> Set up a masquerading entry for
> the local socket to redirect it
> to the remote socket. (Don't forget
> to send whatever already is in
> the buffer.)
Forward the buffer, but instead of
masquerading, tell the sender's proxy
the new location. You can get net-
unique "socket" numbers, btw, without
any central coordination save once per
cluster-connect operation...
> Process starts executing here.
> Pages are not yet transferred.
I would suggest prefaulting the current page, plus half the standard
fault-ahead in either direction. You could even get tricky and forward
all pages which seem pointed to by registers...
> Pagefaults are treated as
> "remote swap", and gotten from
> the donating machine.
> Alternatively the whole memory
> image is tranferred.
I really like the netswap idea here; see my other message for the
rough idea.
> Processes owning fd's to local devices might simply be locked to that
> machine. ("Sorry: that tar keeps running on the tape server")
Dunno about tar; depends on the quality of your net and your tape.
Certainly backing up a remote disk to tape is comparable to backing
up a local disk to a remote tape... Migrating the X server, on the
other hand...
> I'd suggest that we require the hosts to have the same filesytem view.
> (a local disk on hosta needs to be mounted on the same mountpoint on
> hostb as a NFS disk).
File accesses can be forwarded if need be, but assuming a common
cluster filesystem until other things are mostly working is okay
too. But let's not forget semaphores, shared memory, pipes, ...
Not that much besides shared memory can't be handled like a socket.
Keith