Re: quad ppro Compaq proliant 5000 problems with 2.1.106-ac4 was: Re: Any SMP people out there with

Robert G. Brown (rgb@phy.duke.edu)
Fri, 26 Jun 1998 10:12:24 -0400 (EDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Lyle Bickley: "Re: Solved: Strange Kernel Problems"
Previous message: Eric W. Biederman: "Re: Thread implementations..."

On Thu, 25 Jun 1998, Joel Jaeggli wrote:

> On Thu, 25 Jun 1998, Chris Pirih wrote:
>
> > >> > Robert HYATT wrote:
> > >> > > I ran the matrix multiply benchmark... with 1, 2, 3 and 4
> > >> > > processors.
> > >> > > The numbers I got were 52 seconds, 59 seconds, 66 seconds and 73 seconds
>
> At least you know matrix multiply is totaly blocked on main memory,

Actually, you mean that it is mostly UNblocked on main memory (;-).
If it were memory bound you'd get 1/4 the performance running four
jobs; the fact that it costs only an additional 7 seconds for the
first three indicates that the scheduler "automatically" is displacing
the memory access/CPU phases of the running programs to minimally
compete for the memory bus: the CPU/cache phases of two are running
more or less in the memory access phase of the third. The slow/linear
increase of only seven seconds suggests to me that one is seeing
kernel binding, not memory bus blocking per se -- the cost of context
switches and cache flushes, maybe? (Any ideas?) The jump of 21
seconds for the fourth CPU almost certainly means that when a fourth
CPU is added, each CPU ends up being partially blocked when it tries
to read from main memory while a read from another CPU is completed.

Again, this is damn good performance scaling, considerably better than
what one sees with only two PPro CPU's on a 440FX (Natoma)
motherboard. As Robert pointed out, for "many" reasonable mixes of
numerical code (ones with just a bit more CPU locality than is
possible in 40 MB matrix multiplies:-) these numbers mean that EVEN
WITH THE OLD/SLOW 66 MHZ BUS four simultaneous jobs are not
significantly memory bound.

Also again, there is a distinct benchmark suite (produced by Cameron
McKinnon) that cleverly measures the SMP performance of linux WRT
(spinlocked) kernel bound processes. I'd really love to see this
suite repeated for a recent 2.1.X kernel -- it would be a truly great
way to demonstrate its advantage. The URL for this is:

http://www.phy.duke.edu/brahma/benchmarks.smp

and the source is embedded. If anyone does this I would be pleased to
append the results (and include it in a cleaned up overall benchmark
page and forward the whole thing to David Mentre to include in the
linux smp FAQ). I'll do it myself eventually, but I'm still running
2.0.x until network performance under 2.1.x catches up (which may
already have happened, but hadn't as of 2.1.9X last time I tried...)

rgb

Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu

Next message: Lyle Bickley: "Re: Solved: Strange Kernel Problems"
Previous message: Eric W. Biederman: "Re: Thread implementations..."