Re: Interesting analysis of linux kernel threading by IBM

From: Mark Montague (monty@gg.caltech.edu)
Date: Sat Jan 22 2000 - 19:50:16 EST

Next message: Rik van Riel: "[PATCH] kswapd less agressive"
Previous message: Bernard Wei: "Re: stop already!!!!!!!!!!!!!!!!!"
Next in thread: Dale Amon: "Re: Interesting analysis of linux kernel threading by IBM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> : why You all speak about worse designed multithreaded apps.
> : Every technology bad applied is worse !
> : Suppose You have a rendering job to be done.
> : It can be subdivided in a highly parallel system with distinct threads that
> : can run together :
> :
> : 1) Viewing transformation
> : 2) Triangulation
> : 3) Scan conversion
> : 4) Texturing
> : 5) Illumination
> : 6) Frame output
> :
> : just to keep it simple.
> : Now can someone tell me why I would not split my job into threads ?
>
> Rendering is what we call an embarrassingly parallel application.
> In other words, very, very coarse grained parallelism works great for
> this, in fact, it works orders of magnitude better than what you descibed.
> Talk to Disney, Pixar, ILM, RFX - all of whom are heavily into this space,
> all of whom I've visited personally to talk about their computing needs,
> and all of whom use farms of uniprocessors for rendering. There are
> a bunch of other ones too, Digital Design, Pacific something (used to be
> Walnut Creek now are in Palo Alto), etc. All the production and post
> production digital houses know that farms of machines that share nothing
> but a network are the highest performance and least cost way to do
> rendering.

I suspect you mean Digital Domain and Pacific Data Images...

> If you suggested a multithreaded application to do that to any of those
> guys in a job interview, and stuck to your opinion that it was a good
> idea, my predicition is that you would be standing on the street wondering
> what happened in less than 5 minutes. Those people are doing hard work
> on short schedules and and really don't have time to waste.

More importantly, rendering is almost always seperable into completely
different processes, so when some of these companies do use SMP SGIs
and the like, the usual approach is to run N separate renderers
working on different frames. The only context in which SMP helps batch
graphics is when you can share a read-only copy of the geometric model
of your data in core; since it's read-only, no ones cache should get
invalidated, and if, for example, 99% of your scene is objects that
don't move (in 3d) between frames, you can share that dataset.

Similarly, for ray tracing one frame, you can have up to one "thread"
per pixel if you want, but since the pixels never talk to each other,
you can share a read-only copy of the geometry, and the rest are
independent. You don't need to share VM, except for mapping read-only
stuff, allocating who's tracing which pixels and shoving the finished
pixels out into the frame buffer/disk/film recorder/videotape. The
AT&T Pixel Machine, circa 1987(?) (see SIGGRAPH paper on it), took
this approach, by splitting up the task among a number of processors,
and having a hardware "pixel-funnel" to collect the results. I think
they had a whole copy of the geometry on each processor, though, while
other approaches share a read-only copy. If your dataset is too big to
fit in core, I believe that the right answer is to use coherence by
choosing which pixels to assign to which CPU cleverly, but the rest of
the issues still apply.

Almost all the unix-cluster high-end renderers I've dealt with split
things up so each CPU is working on either a segment of the image (for
single images) or a single frame (for animations.)

The only exceptions to this I can think of for batch rendering of
single frames are radiosity and the more high-end rendering-equation
stochastic solvers (e.g. Monte Carlo stuff), which generally use
matrix/FEM code that has been worked out by the applied-math weenies.

For real-time rendering, however, Larry's arguments are a bit
incorrect; there are stages of the pipeline that are well-suited to
parallelization, which is why "traditional" display-list graphics
hardware uses a pipelined style. However, since the flow is only in
one direction, it's very well-suited to a vector processing style of
parallelism, and using shared-vm threads is relatively inefficient
because it makes sacrifices to provide unused communication. The unix

foo | bar | baz

pipeline model isn't all that bad for rendering, in fact, although the
implementation of this on SMP machines is generally sub-optimal for
the cache-invalidation problem brought up earlier. Smart shared-memory
buffering could probably fix this effectively, though.

Also, for this example, if you really care about speed in a
traditional, display-list renderer, you should get the common and
cheap custom hardware that has been severely optimized for this, and
use someone's well-tested code (e.g. OpenGL).

If you want to get into real-time optimizations, though, you *really*
want to throw out the traditional pipeline model anyway, and cheat a
lot, as in the ill-fated Talisman design, UNC's "Frameless Rendering",
QTVR, and lots of image-warping, compositing, and
pre-calculation/caching/integrating of data to avoid re-calculating
things every frame/picture/polygon/instance. Remember that the
graphics pipeline was designed when it was impractical to store a lot
in memory, so re-computing this stuff was necessary.

> I am starting to wonder if you've ever coded up an application both ways
> and tested it. If you had tried the rendering model that you suggested
> and then tried the same thing all in one process, I believe that your
> way would show dramatically lower performance. It's been shown that
> while the model of fine grained parallelism, especially in data parallel
> applications like what you are talking about, while that model can be
> supported, the cache effects of doing so on an SMP dramatically _REDUCE_
> the performance. It's always been seen that you are better off to divide
> up the data, do all the different transformations to a chunk of data by
> one process on one processor in one cache, rather than by spreading the
> same data over a bunch of caches. In fact, all the research in parallel
> applications boils down to ``how much can you divide up the data''.
> If there is so much focus on that, all of it performance related, why
> is it that you believe something that certainly seems to fly the face
> of both theory and practice?

I think the "how much can you divide up the data" approach is even
silly; just share the stuff, since it's generally read-only, except
for radiosity-type things. Economically, it's been more-or-less proven
that the best bet for graphics is workstation farms, where you treat
N-processor SMP boxes as N separate workstations.

I'm also a big fan of smart caches which can get garbage-collected if
memory is tight, and recomputed if needed, but which otherwise
persist, so that if two adjacent pixels hit the same polygon, there's
a good chance it'll be cached, along with the right page in its
texture map, and maybe some of the intermediate valuse for its phong
shading, or whatever. Sometimes, you can trick the vm system into
doing this by tricks like rendering 16x16 tiles of pixels together
rather than marching along a scanline, but a lot of OOP, functional,
and recursive techniques encourage recalculation of this stuff.

> If you don't mind, can you tell me if you have actually tried any of
> these ideas you are advocating, tried them both ways and compared the
> performance? It's been a couple of years since I looked at this in
> depth and perhaps the technology or the transformations have changed
> their characteristics and I'm all wrong, this stuff works great the way
> you say it does. But I'd sure like to understand why that is, if that
> is the case, because it shakes up some pretty basic understanding that
> I have of how the apps and the SMPs work.

Personally, I think that the clone() system call is close to the right
answer; making wise flag choices for calling clone() is a much better
choice for linux than using threads (although threads are more
portable). The only thing I'd like to see that's not in the clone()
design is to have a way of initializing a memory segment and forcing
it to be read-only. With copy-on-write, though, this mostly works on
its own; it'd just be nice to get a segv instead of scribbling all
over your data. Maybe this can be done by clever use of mmap or sysv
IPC shared memory, though; I just suspect they may have overhead
because they are overkill for this use. Another nice option would be
the ability to mmap (or similar) your parent's VM space as read-only,
for implementing producer/consumer or one-way pipelining type things
without needing the kernel to fiddle it.

As long as I'm waxing opinionated, I think "threads" as they are
commonly used lately are a really bad idea, and should be avoided
almost all the time. Sharing VM with another process not only raises
cache invalidation errors, it also grossly violates the virtual memory
and timesharing models of computing, which have been shown to be very
effective at having robust code. Multithreading throws all of that
away, since several processes can now have race conditions and
overwrite problems, and the burden is placed on the programmer to not
make any mistakes in the use of *all* locations in the threads' memory
map. When you fork(), you're *guaranteed* not to be able to scribble
on your parent's data, and the OS has the job of coordinating
parent-child communications, whether it's by pipes, files, shmem,
semaphores, or signals. The number of cases where this is not
fine-grained enough for good performance is *much* smaller than the
number of cases where threads are (ab)used.

The thing that I *really* like about clone() is that it gives some
control over which resources are shared between parent and child. I
think there could be a bit more than it provides, but if you need
something that fork() doesn't provide, then the "tradition" is that
you have to use a thread, and share everything. Asynchronous I/O is
also brought up a lot, but it seems like it's handled by clone(),
although it might be desirable to have read-only pages and page
synchronization cleverness (e.g. wait until a whole VM page is
buffered before forking an async write, so that if the parent is still
pushing stuff into the pipe, it's a separate page, so that doesn't
invalidate the child's cache.)

blah blah rant blah

- M

p.s. I like Larry's SMP scheme, if that wasn't obvious from my
ranting. I just have thought/read/talked about rendering a *lot*.

-- Mark "Monty" Montague | monty@gg.caltech.edu | I don't do Windows(tm) Give me a perl pocket ref, and I'll give you the world. If guns are outlawed, we'll have to kill each other with cryptography. I'd give C++ about a D-... **** Y2K is 2048

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Rik van Riel: "[PATCH] kswapd less agressive"
Previous message: Bernard Wei: "Re: stop already!!!!!!!!!!!!!!!!!"
Next in thread: Dale Amon: "Re: Interesting analysis of linux kernel threading by IBM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:28 EST