Re: FPU memcpy Penalty

Craig Milo Rogers (rogers@isi.edu)
Mon, 10 Jun 96 17:19:36 PDT


> I have a Dell Dimension XPS P90 system with 16 Mbytes of
,,,
>I thought the Dimension was supposed to be Dell's leading edge
>system. I'm surprised it has a motherboard capable of handling a
>single SIMM.

You are right. I opened up the box, and there were two SIMMs
after all. It must've been the system I was working on this weekend
that had only one SIMM... I also found the users' manual fo rht e
system, and it specified a 64-bit data bus. So, I now have a
performance problem with the memcpy patch on a 64-bit Pentium system.

>It might also be your video chip, BTW. It may not handle 64-bit
>transfers (over the 32-bit PCI bus) very efficiently.

Hmmm... I would guess that the X video transfers take place in
user code in the X server, and are unaffected by the FPU memcpy patch.
I assume the patch affects the transfers between the X client and
server (with no SHM extension in use). I assume it affects lots of
non-X system calls, and that might affect the timing!

>Alternatively, join pentium-memcpy@aptinc.com (send mail to
>pentium-memcpy-request@aptinc.com).

OK. I'll send future followups to that list.

> 2) The description of the patch should include words
...
>There should probably be a help description that documents this.
>However, I don't think that it's my job to find every possible Pentium
>system and run a benchmark to see what happens.

Of course not. That's our job on linux-kernel! :-) Your job
is to sort through our reports, and chide us when we don't send them,
and write up the results... :-)

...
> 4) Alternatively, perhaps the poor performance is an artifact of
> the particular program I used as a test; perhaps it does a lot
> of short memcpy calls, for which the patch has (I speculate)
> greater setup time.
>
>It actually doesn't get used unless 1K or more is being copied. It
>does have more setup time (save/restore the FPU context), but 1K is
>enough to more than amortize that out.
>
> 1) Maybe the patch's __generic_memcpy_fromfs and
> __generic_memcpy_tofs calls should be inlined (with
> non-inlined calls to __xcopy_*?)
>
>It does entirely too much (and is way too big) to be inlined. In
>fact, I don't think that memcpy (except for the very shortest constant
>memcpy's) should be inlined at all. It just wastes memory.

I was thinking that the test against the 1K limit (etc.)
could be inlined into the caller, and, with luck, optimized away in
some cases. I haven't looked very hard at the code, but it seems to
me that without the FPU memcpy patch, __generic_memcpy_fromfs and
_generic_memcpy_tofs were inlined. With the patch, they are externed.
So, what I propose is more like:

+static inline void
+___memcpy_fromfs(void * to, const void * from, size_t n)
+{
...

+static void /* Note: not inline. */
+___xcopy_fromfs (char *to, const char *from, int bytes)
+{
...

+#define ALIGNED(x, y) (!(((unsigned) (x)) & ((y) - 1)))
...

+inline void
+__generic_memcpy_fromfs(void *to, const void *from, size_t bytes)
+{
+ if (bytes == 0)
+ goto out;
+ if ((bytes >= 1024) && ALIGNED(to, 8) && ALIGNED(from, 8) && ALIGNED(bytes,25
6))
+ ___xcopy_fromfs (to, from, bytes);
+ else
+ ___memcpy_fromfs(to, from, bytes);
+ out:
+}

Similar changes would apply for __memcpy_g and __memcpy_tofs.
Even still, the overhead of the inline code cause undesireable kernel
bloat.

Craig Milo Rogers