Re: Interesting analysis of linux kernel threading by IBM

From: Davide Libenzi (dlibenzi@maticad.it)
Date: Sun Jan 23 2000 - 07:59:52 EST


On Sat, 22 Jan 2000, Larry McVoy wrote:
> : Are You saying that N processes that run in N uniprocessor systems
> : echanging data through network perform better than a single SMP N way system
> : echanging data in memory due to the cache effects ( given the same
> : software architecture ) ?
>
> Yes, that's exactly correct. It should be obvious if the cost of the
> network is zero, right?
Sorry Larry but this is not a small bit in this context.
Is like speaking about a beautiful girl and say, "OK, ignore the fact that
she/he is a transsexual and has the d*ck" ;) ( I've nothing against trans )

> In both cases, it's very rare that having more threads than CPUs ever
> helps you.
100 % agree with You here.
More threads running on the same CPU helps nothing !

> : > I am starting to wonder if you've ever coded up an application both ways
> : > and tested it.
> :
> : The rendering pipeline ( as the keyword state ) in an highly parallel
> : environment in which a subsystem takes one type of data, transform it in a new
> : kind of data, and pass the result to the next subsystem.
>
> This is exactly the worst thing you could do. Try it. Forget rendering,
> just take N processes in a pipeline and have them touch data and pass
> it to their neighbor, and pin each process to a different processor.
> Unless the computation costs dramatically outweigh the costs of cache
> writebacks and fills, you'll get worse performance than just running
> all the processes on one processor (which really means one cache).
> You can actually build a system like this where you fine tune the amount
> of work you do before you pass the data back and forth. Look at lat_ctx
> in lmbench, it does exactly this (well, close, it single steps, but you
> could make one that is in parallel).

Probably I've said this even before but the renderer I'm speaking about
is a scanline renderer ( which use shadow maps for shadows and enviroment
mapping for reflections ).
Now I'm speaking about renderers but there are also other apps that can be
_well_ coded as parallel processing ( I speak about renderes coz is my work ).

> Your pipeline has builtin cache misses. You are passing the data from
> cache to cache. It's a really bad idea unless the computation heavily
> outweighs the data movement (and for rendering it absolutely does not for
> any rendering engine I've ever seen - anyone who has a counter example
> please tell us about it).

Larry this is exactly the case we're speaking about.
All steps of the pipeline are computing intensive and the time spent
for cache misses are << of the times spent computing.
If we want to speak about pipeline steps that do nothing on data but passing
it to the next step, we can stop this "thread" since I completely agree with
You.
Now I try to explain better these render steps to see want they do _really_ :

Triangulation step takes, say a NURBS object, that resides in memory and feed
the cache of the processor he is running on with that entries.
It does a _computing_intensive_ operation and approximate the surface with 3D
trianagles.
This 3D triangle reside in memory as long as into the cache of this processor
that I'll call A.

Processor B will hold the Viewing Transformation task that is a
_computing_intensive_ process. It _reads_ the 3D triangle that resides in
memory as long into the processor A cache, feeds its own cache entries, and do
3D clipping, 2D projection and 2D viewport clipping. As a result it creates a 2D
polygon ( or triangles is another 2D triangulation step is done) that resides
in memory as long as in processor B cache.

Processor C will hold the Scan Conversion task that is a _computing_intensive_
process.
It reads 2D poligon that reside in memory as long as in processor B cache,
feeds its own cache entries, and create a set of scanlines that resides in
memory as long as in its own cache.

I stop here since other steps are similar viewed from the point of view that I
want to show.

Any write operation on an object will invalidate only two processor cache.
The one of the writer ( obvious ) and the one of the next pipeline step.
And the Viewing Transformation task _only_ read the 3D triangle == no cache
invalidation is done.
Other processor's cache remain untouched coz a NURBS object will never be
readed / writed by the Scan Conversion task.
More the write of a 3D triangle from the processor A will never go to
invalidate C processor cache.
More all the pipeline steps are very _computing_intensive_

> If your question is why are SMPs a good idea, I think the answer is that
> they provide a level of dynamic load balancing that you can't get out of
> a bunch of uniprocessors connected through a network, they also provide a
> much higher bandwidth between nodes, and they frequently have more memory.
This is exactly my view of SMP.
One of our businesses is sell architectural design apps that are targeted for
desktops.
Now I can tell to a client : "Do You want a faster render ? Buy a 4 way SMP
machine !".
This can be accepted.
I cannot tell : "Do You want a faster render ? Setup a Beowulf array in Your
show room !".
He shoots me !

> What I'm less than thrilled about is people jumping in and saying "the
> scheduler is busted because it is slow when I do XYZ" and not listening
> when people who have been doing XYZ for years say that how you are doing
> XYZ is not going to work. You yourself admit that you've never tried
> this stuff on anything but a 2 processor system. I've tried stuff like
> this on everything from uniprocessors up to 2,048 processor systems.
> What's more is that I've spent a lot of time working with the hardware
> designers trying to figure out ways to support the programming model
> that you want, and the conclusion that we've all come to is that, yes,
> we can give you the ILLUSION that you have that model and as long as you
> don't really use it, it works. But if you really think you have a big
> SMP doing fine grained parallel stuff that misses the cache all the time,
> you're flat out crazy and don't understand how hardware works and what
> the limits are.

I've never put under discussion Your preparation in this field, and more I've
never missed other guys respect on the contrary to others in this list ( Mom
look at me, the time You've spent to educate me is not gone lost ).

>
> I'm really not excited about you pushing changes into the scheduler
> to support a model that we already know at the hardware level is
> unsupportable. Especially when those changes introduce so much as one
> cache miss, one more cache line, or one more cycle in the code path that
> we all use everyday all day long. No thanks.

Right now I'm working to modify my cluster scheduler to boot only under highly
( with software tunable trigger point ) loaded runqueue leaving the current
implementation in normal cases, like a turbo when the RPM ( not the RedHat
Package Manager ;) ) of the engine overtake a given entry point.

PS:
I've printed the DOCS You pointed me to but I've got no time to read it due to a
saturday party.
But sure I'll read today.

Davide.

-- 
All this stuff is IMVHO

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:29 EST