Re: Interesting analysis of linux kernel threading by IBM

From: Davide Libenzi (dlibenzi@maticad.it)
Date: Fri Jan 21 2000 - 19:40:02 EST

Next message: Aaron Lehmann: "What to do when out of memory"
Previous message: Nils Faerber: "APM and MTRR"
In reply to: Larry McVoy: "Re: Interesting analysis of linux kernel threading by IBM"
Next in thread: Larry McVoy: "Re: Interesting analysis of linux kernel threading by IBM"
Reply: Larry McVoy: "Re: Interesting analysis of linux kernel threading by IBM"
Reply: Steve Underwood: "Re: Interesting analysis of linux kernel threading by IBM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 21 Jan 2000, Larry McVoy wrote:
> Rendering is what we call an embarrassingly parallel application.
> In other words, very, very coarse grained parallelism works great for
> this, in fact, it works orders of magnitude better than what you descibed.
> Talk to Disney, Pixar, ILM, RFX - all of whom are heavily into this space,
> all of whom I've visited personally to talk about their computing needs,
> and all of whom use farms of uniprocessors for rendering. There are
> a bunch of other ones too, Digital Design, Pacific something (used to be
> Walnut Creek now are in Palo Alto), etc. All the production and post
> production digital houses know that farms of machines that share nothing
> but a network are the highest performance and least cost way to do
> rendering.

Are You saying that N processes that run in N uniprocessor systems
echanging data through network perform better than a single SMP N way system
echanging data in memory due to the cache effects ( given the same
software architecture ) ?

> If you suggested a multithreaded application to do that to any of those
> guys in a job interview, and stuck to your opinion that it was a good
> idea, my predicition is that you would be standing on the street wondering
> what happened in less than 5 minutes. Those people are doing hard work
> on short schedules and and really don't have time to waste.

I've not the luckiness You've to meet so interesting peoples so I can't figure
out what they can say me.

> I am starting to wonder if you've ever coded up an application both ways
> and tested it. If you had tried the rendering model that you suggested
> and then tried the same thing all in one process, I believe that your
> way would show dramatically lower performance. It's been shown that
> while the model of fine grained parallelism, especially in data parallel
> applications like what you are talking about, while that model can be
> supported, the cache effects of doing so on an SMP dramatically _REDUCE_
> the performance. It's always been seen that you are better off to divide
> up the data, do all the different transformations to a chunk of data by
> one process on one processor in one cache, rather than by spreading the
> same data over a bunch of caches. In fact, all the research in parallel
> applications boils down to ``how much can you divide up the data''.
> If there is so much focus on that, all of it performance related, why
> is it that you believe something that certainly seems to fly the face
> of both theory and practice?

The rendering pipeline ( as the keyword state ) in an highly parallel
environment in which a subsystem takes one type of data, transform it in a new
kind of data, and pass the result to the next subsystem. This is true for a
scanline renderer ( using shadow maps and environment mapping ) not for a
raytracer. In this environment I'll espect ( You're right I've only coded
single thread renderers ) that if I decompose the pipeline into N steps and
I've an N way SMP system I'll get good performance. Where good does not mean
TotalTime / N , but a time :

(TotalTime / N) < T << TotalTime

If even an highly parallel job like a renderer cannot be well coded in SMP,
what we keep it for ?

OK, probably the solution You push is clusters of SMPs.

But recalling what I've asked You in head of this message, given a cluster of
N computers having an M way SMP system and exchanging data through an
ethernet, have You measured that ( cost apart ) a single M x N SMP system will
perform ( scale ) less than the cluster ?
I can't believe that cache effects are bigger that ethernet bottleneck.

Unfortunately I don't have neither a Beowulf system nor a 32 SMP system to
try my thoughts ( only a poor 2 way ).

Davide.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Aaron Lehmann: "What to do when out of memory"
Previous message: Nils Faerber: "APM and MTRR"
In reply to: Larry McVoy: "Re: Interesting analysis of linux kernel threading by IBM"
Next in thread: Larry McVoy: "Re: Interesting analysis of linux kernel threading by IBM"
Reply: Larry McVoy: "Re: Interesting analysis of linux kernel threading by IBM"
Reply: Steve Underwood: "Re: Interesting analysis of linux kernel threading by IBM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:27 EST