Re: Scaling noise

From: Eric W. Biederman
Date: Sun Sep 07 2003 - 16:20:43 EST


Larry McVoy <lm@xxxxxxxxxxxx> writes:

> Here's a thought. Maybe the next kernel summit needs to have a CC cluster
> BOF or whatever. I'd be happy to show up, describe what it is that I see
> and have you all try and poke holes in it. If the net result was that you
> walked away with the same picture in your head that I have that would be
> cool. Heck, I'll sponser it and buy beer and food if you like.

Larry CC clusters are an idiotic development target.
The development target should be non coherent clusters.

1) NUMA machines are smaller, more expensive, and less available than
their non cache coherent counter parts.

2) If you can solve the communications problems for a non cache
coherent counter part the solution will also work on a NUMA
machine.

3) People on a NUMA machine can always punt and over share. On a non
cache coherent cluster when people punt they don't share. Not
sharing increases scalability and usually performance.

4) Small start up companies can do non-coherent clusters, and can
scale up. You have to be a substantial company to build a NUMA
machine.

5) NUMA machines are slow. There is not a single NUMA machine in the
top 10 of the top500 supercomputers list. Likely this has more to
do with system sizes supported by the manufacture than inherent
process inferiority, but it makes a difference.

SSI is good and it helps. But that is not the primary management
problem on a large system. The larger you get the imperfection of
your materials tends to be an increasingly dominate factor in
management problems.

For example I routinely reproduce cases where the BIOS does not work
around hardware bugs in a single boot that the motherboard vendors
cannot even reproduce.

Another example is Google who have given up entirely on machines
always working, and has built the software to be robust about error
detection and recovery.

And the SSI solutions are evolving. But the problems are hard.
How do you build a distributed filesystem that scales?
How do you do process migration across machines?
How do you checkpoint a distributed job?
How do you properly build a cluster job scheduler?
How do you handle simultaneous similar actions by a group of nodes?
How do you usefully predict, detect, and isolate hardware failures so
as not to cripple the cluster?
etc.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/