Re: Help in DSM design

From: Jamie Lokier (lk@tantalophile.demon.co.uk)
Date: Sun Mar 05 2000 - 07:07:44 EST


Larry McVoy wrote:
> OK, so that leaves us with programs that are actually sharing memory
> because they need to do - they are communicating with each through the
> memory rather than communicating through messages. Classic examples
> of this are your basic simulation, where each process (thread if you
> must) is working on some subset of the space. Let's say it is a three
> dimensional simulation, so each process has a subcube of the larger
> cube.

Thank you. You've described the perfect application for DSM :-)
See below.

> This is a nasty problem space because each process' forward progress
> depends not only on the results of its local computation but also on
> the results of the computation of its neighbors, since if the cube right
> next to you is blowing up, that's going to have some effect on your
> cube.

Not true for every iteration. You have N*N*N subcubes, one subcube per
node, and the relations are such that it takes N/T iterations for input
on one side of the cube to affect output on the other.

A cube transfers its current state to a surface depth of D (where D <
N) to all its neighbours. It's now possible for every neighbour to
compute for D/T iterations before requiring the next state update.

You still need neighbour updates, but you just decoupled the
communication "batch" rate.

Choose N and D s.t. computation dominates over latency per *batch*
communication and you're doing well.

A naive DSM won't be able to do batches like that, but one which spots
the regular access patterns of a regular simulation will. (Use a
similar algorithm to the one used to predict disk access pattterns).

> So contrast this model with "CPU from executing another thread while
> it waits" statement. The only way you can keep all the CPUs busy all
> the time is by adding other computations - it's obvious that this
> computation will go much much slower under a DSM, isn't it?

No it isn't. Once you've got computation of a single iteration
dominating over communications latency per iteration, you only need to
run two threads on one node to absorb the latency altogether.

For a 3d simulation that's simple: use smaller cubes and run more than
one per node.

> If it's not obvious, you haven't thought about it enough, it's definitely
> obvious to anyone who's worked in this space. So you could run multiple
> simulations to cover up the latency. But that's wacko too - the latencies
> are so high that you'd need 100s or 1000s of other tasks to cover up
> the latencies.

You're assuming computation is a low overhead compared with
communication. Consider: suppose the communication per iteration takes
1s. Ew, massive latency.

Suppose the computation takes 100s. Communication doesn't look
so horrible now does it?

The messy part is those pesky locks need to keep the simulation
synchronised. I sense a "typedef { ... } semaphore_t __attribute__ ((aligned
(PAGE_SIZE)));" coming on :-)

Granted, it may be easier to use explicit message passing once you've
had to think about the messy details anyway...

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Mar 07 2000 - 21:00:18 EST