Here it definitely looks like the effects of the pentium memcpy() code: the
file re-read speed of your 2.0.23 is better than on 2.1.6, and is actually
better than the libc bcopy.
I'd have expected the same thing in the pipe throughput too, but it
seems the overhead for context switching the FPU state might impact
the pipe throughput negatively (wild hand-waving here ;)
The Pentium memcpy code doesn't kick in until the amount copied is up
to 1K; my measurements suggest that it's actually a net win even at
512 bytes. What I suspect is happening is that the pipe buffers fit
in cache. On my system, the FPU does better than rep movsd when it's
in L2 cache, but not in L1. I have async cache; it's quite possible
that pipeline burst acts differently. Upshot: the FPU memcpy
generally does relatively poorly when the destination is already in
cache, since the FPU instructions are slow.
-- Robert Krawitz <email@example.com> http://www.tiac.net/users/rlk/
Member of the League for Programming Freedom -- mail firstname.lastname@example.org Tall Clubs International -- email@example.com or 1-800-521-2512