RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S byfast string.

From: Ma, Ling
Date: Wed Nov 11 2009 - 21:13:33 EST


>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@xxxxxxxxx]
>Sent: 2009年11月12日 7:21
>To: Ma, Ling
>Cc: Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/10/2009 11:57 PM, Ma, Ling wrote:
>> Hi Ingo
>>
>> This program is for 64bit version, so please use 'cc -o memcpy memcpy.c -O2
>-m64'
>>
>
>I did some measurements with this program; I added power-of-two
>measurements from 1-512 bytes, plus some different alignments, and found
>some very interesting results:
>
>Nehalem:
> memcpy_new is a win for 1024+ bytes, but *also* a win for 2-32
> bytes, where the old code apparently performs appallingly bad.
>
> memcpy_new loses in the 64-512 byte range, so the 1024
> threshold is probably justified.
>
>Core2:
> memcpy_new is a win for <= 512 bytes, but a lose for larger
> copies (possibly a win again for 16K+ copies, but those are
> very rare in the Linux kernel.) Surprise...
>
> However, the difference is very small.
>
>However, I had overlooked something much more fundamental about your
>patch. On Nehalem, at least *it will never get executed* (except during
>very early startup), because we replace the memcpy code with a jmp to
>memcpy_c on any CPU which has X86_FEATURE_REP_GOOD, which includes Nehalem.
>
>So the patch is a no-op on Nehalem, and any other modern CPU.

[Ma Ling]
It is good for modern CPU, our original intention is also to introduce movsq for Nehalem, above method is more smart.

>Am I guessing that the perf numbers you posted originally were all from
>your user space test program?

[Ma Ling]
Yes, they are all from this program, and I'm confused about measurement values will be different for only one case after multiple tests.
(3 times at least on my core2 platform).

Thanks
Ling
㈤旃??????+-遍荻?w??笔???dz罐??骅w*jg??????/??罐????璀??摺?囤??????:+v???佶>W?贽i?xPj??? -?+?d?