Re: FPU memcpy Penalty

Robert L Krawitz (rlk@tiac.net)
Mon, 10 Jun 1996 19:29:05 -0400


Cc: torvalds@cs.helsinki.fi, linux-kernel@vger.rutgers.edu
Date: Mon, 10 Jun 96 16:00:04 PDT
From: Craig Milo Rogers <rogers@ISI.EDU>

I have a Dell Dimension XPS P90 system with 16 Mbytes of
memory . I believe the memory is in a single 72-pin SIMM, although I
didn't open the box to confirm it prior to sending this message. The
system has a #9 GXE 64 (S3-based) video board with 2 MB of memory.

If you really do have that memory configuration, your machine is
crippled anyway (poor memory bandwidth). Operating on a half wide
memory bus is a lose.

I thought the Dimension was supposed to be Dell's leading edge
system. I'm surprised it has a motherboard capable of handling a
single SIMM.

The FPU memcpy patch Web page
(http://www.tiac.net/users/rlk/linux.html) states, "It will do no good
on a crippled Pentium with a 32-bit memory bus (if your motherboard
can accept a single 72-pin SIMM)." It doesn't say, "May further
reduce performance on a crippled Pentium', so I applied the patch in
1.99.12 and above. The patch inserted smoothly, and there were no
apparent immediate adverse effects.

I'm surprised it hurts performance that badly. I would expect it to
hurt performance slightly (maybe 10%, tops, and maybe not at all).
I've never personally run on a crippled Pentium, so I didn't know that
it would clobber performance. One person did try it on a laptop with
a crippled memory system and found no change in performance.

It might also be your video chip, BTW. It may not handle 64-bit
transfers (over the 32-bit PCI bus) very efficiently.

1) I hope others will run similar tests, but I suggest that
results should be sent solely to Robert Krawitz
<rlk@tiac.net> to avoid bogging down the linux-kernel list.

Alternatively, join pentium-memcpy@aptinc.com (send mail to
pentium-memcpy-request@aptinc.com).

2) The description of the patch should include words
to the effect that it may decrease performance on
some systems. This warning should be included future
versions of the patch itself, as well as appear on the Web page.

There should probably be a help description that documents this.
However, I don't think that it's my job to find every possible Pentium
system and run a benchmark to see what happens.

3) If my problem is due to a factor such as my machine's
bus width, perhaps the kernel (or the configuration
process) could automatically choose the best memcpy
for a particular system based on a system startup
timing test?

1) Perhaps "make config" could run a small program to determine
the better algorithm on a particular system (after asking
whether to do so)?

Somebody want to write this?

4) Alternatively, perhaps the poor performance is an artifact of
the particular program I used as a test; perhaps it does a lot
of short memcpy calls, for which the patch has (I speculate)
greater setup time.

It actually doesn't get used unless 1K or more is being copied. It
does have more setup time (save/restore the FPU context), but 1K is
enough to more than amortize that out.

1) Maybe the patch's __generic_memcpy_fromfs and
__generic_memcpy_tofs calls should be inlined (with
non-inlined calls to __xcopy_*?)

It does entirely too much (and is way too big) to be inlined. In
fact, I don't think that memcpy (except for the very shortest constant
memcpy's) should be inlined at all. It just wastes memory.

-- 
Robert Krawitz <rlk@tiac.net>           http://www.tiac.net/users/rlk/

Member of the League for Programming Freedom -- mail lpf@uunet.uu.net Tall Clubs International -- tci-request@aptinc.com or 1-800-521-2512