Re: Lets get this right (WAS RE:MOSIX and kernel mods)

Rogier Wolff (R.E.Wolff@BitWizard.nl)
Sun, 7 Mar 1999 12:07:35 +0100 (MET)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Jan Rychter: "Re: user space writel() etc. in 2.2.2"
Previous message: Albert D. Cahalan: "Re: Lets get this right (WAS RE:MOSIX and kernel mods)"
In reply to: Tom Sightler: "Re: Intel EtherExpress PRO/100 Server Adapter"
Next in thread: Jeff Millar: "Re: Lets get this right (WAS RE:MOSIX and kernel mods)"

Richard Gooch wrote:

[Good "rant" about why networking is lagging behind processors
in terms of speed, and that the gap is NOT closing, but rather expanding]

> Let me make a bottom-line statement: networking will *never* be as
> fast as internal computer speeds. Technology trends point to computers
> increasing at a greater rate. And then there's physics. Unless you're
> planning on breaking the speed of light, it's always going to take a
> lot longer to send information across a room than across a chip.

On the other hand, if the software can make good use of certain
features, the hardware-manufacturers will in turn see that it benefits
them provide more of those features.

In the case at hand, if Linux allows us to build a cluster of machines
that transparently to the user handle a workload much larger than one
machine could handle, faster and lower-latency networking products
will become more and more profitable to produce and sell.

Suppose you want to make a distributed shared memory machine. What
"far-memory-latency" would be acceptable? Inside the machine we
currently achieve:

primary cache: 0.005 us (2 cycles at 400MHz ?)
secondary cache: 0.02 us
main memory: 0.1 us
disk: 12000 us (6ms avg rotational latency +
6ms avg seek time)

This clearly demonstrates that you DON"T want to start paging...

Ok, Now lets assume that the "further away" latency penalty can be
allowed a factor of 10. So we have a 1us timeframe. On "current"
technology (SCSI, 80MB/sec) we can transfer 80 bytes in that time.

Enough to tell the other side "please get me (sub-)page xyz". Current
interrupt latencies and software limitations don't allow us to do it
that fast. But those can and will be improved if they become an issue.

One microsecond is also enough for light to travel 300 meters. So at
those latencies, "across the room" is not yet a problem. Not even
close. Electrical signals can travel at about a third of the
lightspeed, so that doesn't pose any trouble either.

However, having a bit too little memory already pushes you to "swap",
and that costs MUCH more than a single microsecond. Suppose a task
with 64Mb of VM needs to considered for migration. On 80Mb per second
networks, you want to save at least the transfer time. That would come
to about one second of CPU. If the application is already running a
few minutes, and you could assign say a 20% faster CPU, that is likely
to be worth the trouble... If at first latencies are on the order of
one millisecond, it is still likely to be better to "page" to other
machines RAM, than to DISK. It is already useful. (consider paging to
another machine over 100Mb ethernet. Throughput is the same as a
reasonable disk. Seek times are much better though...)

In my masters thesis, I realized that there are two types of
applications that can be paralized. One has a data-dependency graph
that has loops. Others don't.

Consider FFT. No loops. Here latency is not an issue. Bandwidth
is. Also the "100 users are doing independent things" case falls under
this category.

Most applications however DO have loops in their data-dependencies.
In that case the latencies are the bottle neck. Consider for example
an Finite Elements program. You need the results of the previous
timestep of the adjacent elements for the current timestep.

If Linux moves towards allowing clusters of machines to function as
one if they are networked together, the push for faster and
lower-latency network-devices will get more and more significant.

The software layer should "know" about the latencies and bandwidth
restrictions in the form of tuneable parameters. A task that has run
for an hour, and does on average one kbyte of IO per second of CPUtime
(e.g. a raytracing program), is worth migrating to a different
(faster) host. On any network.

An application doing 1Mb of IO per second might be worth migrating
towards the data. On the other hand, for gigabit ethernet or 80Mb/sec
SCSI networks that is still insignificant enought to allow this
application to migrate towards a faster host, if the running time
seems to warrant that.

Most of this is long-term future. I don't know wether we can get
Linux to the point where the clustering features become important
enough for the industry to start thinking about our "needs".

Current state-of-the-art proves Richard and Larry right: for best
performance the API should match the hardware. (Don't provide an DSM
interface on message passing hardware or the other way around!)

But looking towards the future, I don't think we should dismiss
thinking about clusters of machines and how we can make them perform.
Even if the interfaces are not optimal right now.

Regards,

Roger.

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
*   Never blow in a cat's ear because if you do, usually after three or  *
*   four times, they will bite your lips!  And they don't let go for at  *
*   least a minute. -- Lisa Coburn, age 9

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jan Rychter: "Re: user space writel() etc. in 2.2.2"
Previous message: Albert D. Cahalan: "Re: Lets get this right (WAS RE:MOSIX and kernel mods)"
In reply to: Tom Sightler: "Re: Intel EtherExpress PRO/100 Server Adapter"
Next in thread: Jeff Millar: "Re: Lets get this right (WAS RE:MOSIX and kernel mods)"