RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S byfast string.

From: Ma, Ling
Date: Mon Nov 09 2009 - 02:25:29 EST


Hi All

Today we run our benchmark on Core2 and Sandy Bridge:

1. Retrieve result on Core2
Speedup on Core2
Len Alignement Speedup
1024, 0/ 0: 0.95x
2048, 0/ 0: 1.03x
3072, 0/ 0: 1.02x
4096, 0/ 0: 1.09x
5120, 0/ 0: 1.13x
6144, 0/ 0: 1.13x
7168, 0/ 0: 1.14x
8192, 0/ 0: 1.13x
9216, 0/ 0: 1.14x
10240, 0/ 0: 0.99x
11264, 0/ 0: 1.14x
12288, 0/ 0: 1.14x
13312, 0/ 0: 1.10x
14336, 0/ 0: 1.10x
15360, 0/ 0: 1.13x
Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

3323.041832 task-clock-msecs # 0.998 CPUs ( +- 0.016% )
22 context-switches # 0.000 M/sec ( +- 31.913% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.001 M/sec ( +- 0.003% )
9921549804 cycles # 2985.683 M/sec ( +- 0.016% )
10863809359 instructions # 1.095 IPC ( +- 0.000% )
972283451 cache-references # 292.588 M/sec ( +- 0.018% )
17703 cache-misses # 0.005 M/sec ( +- 4.304% )

3.330714469 seconds time elapsed ( +- 0.021% )
After the patch:
Performance counter stats for './static_new' (10 runs):
3392.902871 task-clock-msecs # 0.998 CPUs ( +- 0.226% )
21 context-switches # 0.000 M/sec ( +- 30.982% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.001 M/sec ( +- 0.003% )
10130188030 cycles # 2985.699 M/sec ( +- 0.227% )
391981414 instructions # 0.039 IPC ( +- 0.013% )
874161826 cache-references # 257.644 M/sec ( +- 3.034% )
17628 cache-misses # 0.005 M/sec ( +- 4.577% )

3.400681174 seconds time elapsed ( +- 0.219% )

2. Retrieve result on Sandy Bridge
Speedup on Sandy Bridge
Len Alignement Speedup
1024, 0/ 0: 1.08x
2048, 0/ 0: 1.42x
3072, 0/ 0: 1.51x
4096, 0/ 0: 1.63x
5120, 0/ 0: 1.67x
6144, 0/ 0: 1.72x
7168, 0/ 0: 1.75x
8192, 0/ 0: 1.77x
9216, 0/ 0: 1.80x
10240, 0/ 0: 1.80x
11264, 0/ 0: 1.82x
12288, 0/ 0: 1.85x
13312, 0/ 0: 1.85x
14336, 0/ 0: 1.88x
15360, 0/ 0: 1.88x

Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

3787.441240 task-clock-msecs # 0.995 CPUs ( +- 0.140% )
8 context-switches # 0.000 M/sec ( +- 22.602% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.001 M/sec ( +- 0.003% )
6053487926 cycles # 1598.305 M/sec ( +- 0.140% )
10861025194 instructions # 1.794 IPC ( +- 0.001% )
2823963 cache-references # 0.746 M/sec ( +- 69.345% )
266000 cache-misses # 0.070 M/sec ( +- 0.980% )

3.805400837 seconds time elapsed ( +- 0.139% )
After the patch:
Performance counter stats for './static_new' (10 runs):

2879.424879 task-clock-msecs # 0.995 CPUs ( +- 0.076% )
10 context-switches # 0.000 M/sec ( +- 24.761% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.002 M/sec ( +- 0.003% )
4602155158 cycles # 1598.290 M/sec ( +- 0.076% )
386146993 instructions # 0.084 IPC ( +- 0.005% )
520008 cache-references # 0.181 M/sec ( +- 8.077% )
267345 cache-misses # 0.093 M/sec ( +- 0.792% )

2.893813235 seconds time elapsed ( +- 0.085% )

Thanks
Ling

>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@xxxxxxxxx]
>Sent: 2009年11月7日 3:26
>To: Ma, Ling
>Cc: mingo@xxxxxxx; tglx@xxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/06/2009 09:07 AM, H. Peter Anvin wrote:
>>
>> Where did the 1024 byte threshold come from? It seems a bit high to me,
>> and is at the very best a CPU-specific tuning factor.
>>
>> Andi is of course correct that older CPUs might suffer (sadly enough),
>> which is why we'd at the very least need some idea of what the
>> performance impact on those older CPUs would look like -- at that point
>> we can make a decision to just unconditionally do the rep movs or
>> consider some system where we point at different implementations for
>> different processors -- memcpy is probably one of the very few
>> operations for which something like that would make sense.
>>
>
>To be expicit: Ling, would you be willing to run some benchmarks across
>processors to see how this performs on non-Nehalem CPUs?
>
> -hpa
韬{.n?????%?lzwm?b?Р骒r?zXЩ??{ay????j?f"?????ア?⒎?:+v???????赙zZ+????"?!?O???v??m?鹈 n?帼Y&—