Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S byfast string.

From: H. Peter Anvin
Date: Wed Nov 11 2009 - 18:21:47 EST


On 11/10/2009 11:57 PM, Ma, Ling wrote:
> Hi Ingo
>
> This program is for 64bit version, so please use 'cc -o memcpy memcpy.c -O2 -m64'
>

I did some measurements with this program; I added power-of-two
measurements from 1-512 bytes, plus some different alignments, and found
some very interesting results:

Nehalem:
memcpy_new is a win for 1024+ bytes, but *also* a win for 2-32
bytes, where the old code apparently performs appallingly bad.

memcpy_new loses in the 64-512 byte range, so the 1024
threshold is probably justified.

Core2:
memcpy_new is a win for <= 512 bytes, but a lose for larger
copies (possibly a win again for 16K+ copies, but those are
very rare in the Linux kernel.) Surprise...

However, the difference is very small.

However, I had overlooked something much more fundamental about your
patch. On Nehalem, at least *it will never get executed* (except during
very early startup), because we replace the memcpy code with a jmp to
memcpy_c on any CPU which has X86_FEATURE_REP_GOOD, which includes Nehalem.

So the patch is a no-op on Nehalem, and any other modern CPU.

Am I guessing that the perf numbers you posted originally were all from
your user space test program?

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/