Re: Interesting analysis of linux kernel threading by IBM

From: Larry McVoy (lm@bitmover.com)
Date: Fri Jan 21 2000 - 21:09:32 EST


Davide:
: On Fri, 21 Jan 2000, Larry McVoy wrote:
: > Rendering is what we call an embarrassingly parallel application.
: > In other words, very, very coarse grained parallelism works great for
: > this, in fact, it works orders of magnitude better than what you descibed.
: > Talk to Disney, Pixar, ILM, RFX - all of whom are heavily into this space,
: > all of whom I've visited personally to talk about their computing needs,
: > and all of whom use farms of uniprocessors for rendering. There are
: > a bunch of other ones too, Digital Design, Pacific something (used to be
: > Walnut Creek now are in Palo Alto), etc. All the production and post
: > production digital houses know that farms of machines that share nothing
: > but a network are the highest performance and least cost way to do
: > rendering.
:
: Are You saying that N processes that run in N uniprocessor systems
: echanging data through network perform better than a single SMP N way system
: echanging data in memory due to the cache effects ( given the same
: software architecture ) ?

Yes, that's exactly correct. It should be obvious if the cost of the
network is zero, right? Maybe not. Let's be explicit about that: N
processors on N non-shared memory subsystems will always have better
performance that N processors on one shared memory subsystem if the
performance of the memory subsystem is held constant and the N processors
are not running 100% out of the cache.

That isn't quite true, if the shared case has just enough cache misses
to use up the shared memory subsystem, but no more, then the performance
will be identical. But I think you can see the point: if there is no
sharing going on, having a shared resource is just a bottleneck.

This generalizes. If you work out the cost of the communication vs the
cost of computionation (sometimes called the CPU:I/O ratio), then places
where CPU / I/O is large are called embarassingly parallel. The larger
the number, the more you can scale without any help from SMP at all.

Where the number is smaller, we are getting into fine grained parallelism.
Important examples include computational simulations where the data is
changing a lot and to move from step one to step two you have to exchange
boundry values with your neighbor.

In both cases, it's very rare that having more threads than CPUs ever
helps you. As Alan said, threads are for people who can't program
state machines. He's dead on correct and if you think about it, threads
have inherently more cache footprint than a state driven model (this
actually isn't true in theory in all cases but man, oh, man is it ever
true in practice. I don't know what it is but people who use threads
have this builtin assumption that memory bandwidth is infinite and cache
size is infinite, etc.)

: > I am starting to wonder if you've ever coded up an application both ways
: > and tested it.
:
: The rendering pipeline ( as the keyword state ) in an highly parallel
: environment in which a subsystem takes one type of data, transform it in a new
: kind of data, and pass the result to the next subsystem.

This is exactly the worst thing you could do. Try it. Forget rendering,
just take N processes in a pipeline and have them touch data and pass
it to their neighbor, and pin each process to a different processor.
Unless the computation costs dramatically outweigh the costs of cache
writebacks and fills, you'll get worse performance than just running
all the processes on one processor (which really means one cache).
You can actually build a system like this where you fine tune the amount
of work you do before you pass the data back and forth. Look at lat_ctx
in lmbench, it does exactly this (well, close, it single steps, but you
could make one that is in parallel).

If you look at the average clock cycle these days, it's about 2 nanosecs.
A single cache miss is around 120ns on a UP and goes up to around 200-400
on SMPs. So if you have a cache miss every 60 cycles, you are running at
50% of what the CPU can do.

Your pipeline has builtin cache misses. You are passing the data from
cache to cache. It's a really bad idea unless the computation heavily
outweighs the data movement (and for rendering it absolutely does not for
any rendering engine I've ever seen - anyone who has a counter example
please tell us about it).

: If even an highly parallel job like a renderer cannot be well coded in SMP,
: what we keep it for ?

If your question is why are SMPs a good idea, I think the answer is that
they provide a level of dynamic load balancing that you can't get out of
a bunch of uniprocessors connected through a network, they also provide a
much higher bandwidth between nodes, and they frequently have more memory.

: OK, probably the solution You push is clusters of SMPs.

I don't push anything. I avoid things where I've tried it and watched
other people try it and I can sit down and write out a proof that says it
still won't work. I do try and explain this stuff to people who haven't
tried it yet, but that is almost always a failed attempt on my part.
People really want to bang their head against the wall for themselves
and convince themself that it will or won't work. Which is fine with me,
I strongly encourage you to write little test programs which emulate your
application and see if what I'm saying makes sense. I'll even help you,
I have lots of code set up to do this sort of thing in lmbench.

What I'm less than thrilled about is people jumping in and saying "the
scheduler is busted because it is slow when I do XYZ" and not listening
when people who have been doing XYZ for years say that how you are doing
XYZ is not going to work. You yourself admit that you've never tried
this stuff on anything but a 2 processor system. I've tried stuff like
this on everything from uniprocessors up to 2,048 processor systems.
What's more is that I've spent a lot of time working with the hardware
designers trying to figure out ways to support the programming model
that you want, and the conclusion that we've all come to is that, yes,
we can give you the ILLUSION that you have that model and as long as you
don't really use it, it works. But if you really think you have a big
SMP doing fine grained parallel stuff that misses the cache all the time,
you're flat out crazy and don't understand how hardware works and what
the limits are.

I'm really not excited about you pushing changes into the scheduler
to support a model that we already know at the hardware level is
unsupportable. Especially when those changes introduce so much as one
cache miss, one more cache line, or one more cycle in the code path that
we all use everyday all day long. No thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:27 EST