Re: MOSIX and kernel mods.

Jelle Foks (jelle@flying.demon.nl)
Sun, 7 Mar 1999 03:39:19 +0100 (CET)


Let me butt-in on this.

On Fri, 5 Mar 1999, Richard Gooch wrote:

> Simon Kenyon writes:
> > On 05-Mar-99 Richard Gooch wrote:
> > > I can't think of any CS invention that is used in the real world that
> > > has cost 4 orders of magnitude in performance.
> >=20
> > paging to/from disk
>
> No, that's not a fair comparison. Paging/swapping is the only way to
> extend virtual memory space without buying more RAM. Is there an
> alternative method of increasing VM size without paging, and is it any
> faster?
> Also, typical programmes don't suffer much performance loss with
> paging.

Paging is not the only way. Compressed RAM is another, and yes,
compression and decompression of RAM can be much faster than disk I/O
(small/quick estimate: on a 450Mhz machine you have (after tax?) >40
cycles per byte to get 10MB/s performance. Implementing a compression
algorithm that uses avg 40 cycles per byte should not be impossible, and
the loss of performance due to the CPU not being available [for other
processes during a page] could in many cases very well outweigh the >10ms
seek times+access latency penalties that paging on hard drives gives you).

> DSM is not the only solution to parallel/distributed computing. DSM
> can destroy performance for typical threaded programmes. Typical MPI
> programmes will work fairly well on a cluster.

>From a user's point of view: Transparent clustering is a definite yes-yes.
What I'm thinking about is multi-purpose servers, on which there are many
processes (or groups of them) that need little or no IPC. For example an
application where multiple users (>50) use the same server for various
tasks. Now we'd like to be able to deal with an average of 20 users
running heavy tasts (simulations, etc), without the server grinding down
to a near-halt under a load of over 20. Process migration would be a great
help here, where the simulations are dynamically moved to other systems.
Without process migration, we'd have to tell the users to 'before you
start a simulation, first look for an unloaded server, log on there and
start your simulation there', or we'd have to tell the users 'go tell the
software companies to make special cluster-versions of the program for you
[without charging you more for it, and without letting you wait months for
the regression-tested release first]'.

In my opinion it's the OS's job to determine how much IPC there is between
processes or threads and then to decide whether or not those processes
should be placed on the same processor, cluster, or even WAN-location. The
more this can be done transparently to the applications, the more actually
available applications will be able to make use of clusters.

The IPC latencies, and their differences between single processor, SMP,
and network clusters are just the parameters for optimization of the
choice on which CPU the process should run. If you make a distinction
where you explicitly say 'SHM is fast, IPC over net is slow', then you
forget that the 'fast' or 'slow'-ness of data transfer is relative to the
available CPU power and available network speed (which varies for each
cluster, and which varies over time).

Consider the case when the CPU's in a cluster are heavily loaded by
processes that need little communication and there is still ample gigabits
of network bandwidth left. In that case, the network latencies become much
less important (this is for the same reasons why seek-latencies of
hard-disks are less important on multitasking machines). If we have an
application that assumes that network-based IPC is relatively expensive,
we get a suboptimum in this case. Hence we don't want application
programmers to optimize for specific hardware and load configurations.

What would be great if the process migration implementation could catch
the IPC (whether it is SHM or not), profile it, and based on the actual
situation decide (or revise a previously made decision) on which CPU to
run the processes.

This boils down to having a single process sheduler per cluster that works
from the single CPU level up to the network level, which schedules all
processes in the cluster across all resources in the cluster. We've not
had user-space processes do their own scheduling on single CPU or SMP
systems, so why start doing that in clusters?

Jelle.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/