Re: Forking (fwd)

Robert Glamm (glamm@mountains.ee.umn.edu)
Fri, 13 Jun 1997 11:06:04 -0500 (CDT)


> > > > Is it possible to create a new process which shares the data with the
> > > > parent?
> > >
> > > Yes, by specifying such an option to clone() (which fork() uses to
> > > implement its behaviour, btw). That is not meant to be used in
> > > applications, though. You should use a thread library that supports the
> > > clone() system call (such as linuxthreads).
> >
> > So why is clone() not meant to be used in applications? If I want
> > full control over how my program is parallelized across multiple processors,
> > I'm going to want use clone(). If you claim that we shouldn't use clone()
> > for "portability" reasons, that's a bunch of BS - if you want your
> > application to all-out perform on SMP machines, you have to hand-tune
> > it for each architecture, making portability pointless anyway.
>
> Because you get the same performance/control with a threads library, but you get a fully
> portable ptogram that way.

Wow, one would think that from the above comment the comment I made wasn't
even read...

NO, you don't get the same performance/control with a threads library.
If anyone would like to argue this point with me I will happily hand-tune
some SMP code for various architectures that will beat any
threads-library coded software. Here's why:

1) if you're coding for SMP, you want to extract _AS MUCH PERFORMANCE
AS POSSIBLE_ from the system you're running on. Otherwise, why
bother to code for SMP at all? I'm talking apps here that really
_need_ SMP performance, not some silly little Web server process.
And if anyone gives me another RC5 cracking example, I will personally
shoot them. Why? Because RC5 is so parallel it's trivial -- it's
a simple matter to divide up the keyspace and distribute it across
multiple machines without much overhead regardless of how that
distribution is done. There are few applications that are this
data parallel.

2) Now, given that you want to extract as much performance as possible
from your SMP machine, you need to code it on a _machine by machine_
basis if you want it to run on different platforms. Thus, the portability
achieved by the threads library is pointless. Given an app coded
using the threads library and one hand-tuned per machine across
a bunch of different platforms the hand-tuned ones will win hands down.
When I say `hand-tuned per platform' I mean taking _all_ of the
machine's characteristics into account, either at compile-time or at
run time: L1/L2 cache sizes, time per memory reference, average
disk access time and transfer rates (if necessary), average semaphore
contention time between processors, etc. Bottom line: if you tell
me that your nice portable threads library can do better than
my (assembly!) semaphore/lock code + hand-tuned SMP code, you're out
of your mind for any reasonably complex SMP problem.

Commercial applications developers must agree with my points, otherwise
why would they spend so much time and effort hand-tuning their SMP code
for so many different platforms? E.g.: Cray has an entire DIVISION
devoted to helping app developers tune their packages to particular
platforms. IBM (for their cluster & SMP machines) and SGI are both
in similar situations.

Of course, if you don't agree that you want to get as much performance
as possible from your SMP machine, then the above really doesn't apply.
But then why go through all the trouble of using SMP in the first place
if you don't care about performance?

-- 
"If HP was only an $8 billion  | Bob Glamm  H: +1 612 6239437 W: +1 612 6268981 
 company like Sun, we also     | URL:        http://www-mount.ee.umn.edu/~glamm
 might be less ambitious."     +-----------------------------------------------
 -- HP's Lewis Platt referring to Sun's refusal to support Windows NT