Re: faster strcpy()

Richard B. Johnson (root@chaos.analogic.com)
Sun, 26 Apr 1998 00:39:00 -0400 (EDT)


On Sat, 25 Apr 1998, Tim Smith wrote:

> At 08:22 AM 4/24/98 -0400, Richard B. Johnson wrote:
> >> #define strcpy(a,b) (char *)memcpy(a,b,strlen(b))
> ...
> >Could you please describe how you measured its speed? You actually
> >more than doubled the number of CPU instructions that need to be
> >executed to perform this function.
>
> You aren't counting right. The standard C while loop version does
> ~N reads and N writes, where N is the length of the string, plus N
> tests and N jumps.
>
> A strlen will do ~N reads, N tests, and N jumps. Even a slightly
> optimized memcpy (no tricks with floating point registers or anything
> like that, just longword moving) will do ~N/4 reads, N/4 writes, N/4
> tests, and N/4 jumps. Total reads, tests, and jumps are up by 25%,
> but writes are down by 75%. It is nowhere near double the number
> of instructions. If the memcpy unrolls the loop, the strlen + memcpy
> version is likely to use less instructions.
>
The 'standard C while loop' means absolutely nothing. There is no such
thing. The processor executes an instruction stream. It does not
execute 'C' code.

As previously shown in assembly code. To obtain the length, requires
that the string be read.

Then to copy the string, requires that the string be read again.

While copying the string using memcpy(), a loop count must be
tested. While copying directly, a byte must be tested. This is
essentially a wash.

A simple test program, previously posted, that uses both methods,
verifies my claims.

As previously posted, the simplist string copy is not the most efficient,
however it will serve to show the point.

Simple string copy guaranteed to work (not very efficient).

mov esi,offset source ; 4 clocks
mov edi,offset destination ; 4 clocks
cpy: lodsb ; 6 clocks
stosb ; 6 clocks
or al,al ; 2 clocks
jnz cpy ; 2 to many clocks, depends upon
; the cache.

Simple strlen, guaranteed to work (not the most efficient).

mov esi,offset source ; 4 clocks
mov edx,esi ; 2 clocks
xor al,al ; 2 clocks
len: lodsb ; 6 clocks
or al,al ; 2 clocks
jnz len ; 2 to many clocks.
mov eax,esi ; 2 clocks
sub eax,edx ; 2 clocks
; Length in eax

Simple memcpy, guaranteed to work (not the most efficient)

mov esi,offset source ; 4 clocks
mov edi,offset destination ; 4 clocks
mov ecx,dword ptr [count] ; 6 clocks
shr ecx,1 ; 2 clocks
rep movsw ; 6 * number of words
adc ecx,ecx ; 2 clocks
rep movsb ; 6 * number of bytes

Now, if you add up the clocks for strlen() and the clocks for
nemcpy(), you can compare them to the clocks for strcpy().

I do this exact kind of analysis and work for a living and I am
very good at it.

Cheers,
Dick Johnson
***** FILE SYSTEM MODIFIED *****
Penguin : Linux version 2.1.92 on an i586 machine (66.15 BogoMips).
Warning : It's hard to remain at the trailing edge of technology.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu