Re: Csum and csum copyroutines benchmark

From: Momchil Velikov (velco@fadata.bg)
Date: Fri Oct 25 2002 - 04:47:05 EST


>>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes:

Denis> [please drop libc from CC:]
Denis> On 25 October 2002 05:48, Momchil Velikov wrote:
>>> Short conclusion:
>>> 1. It is possible to speed up csum routines for AMD processors
>>> by 30%.
>>> 2. It is possible to speed up csum_copy routines for both AMD
>>> andd Intel three times or more.

>> Additional data point:
>>
>> Short summary:
>> 1. Checksum - kernelpii_csum is ~19% faster
>> 2. Copy - lernelpii_csum is ~6% faster
>>
>> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>>
>> The only changes I made were to decrease the buffer size to 1K (as I
>> think this is more representative to a network packet size, correct
>> me if I'm wrong) and increase the runs to 1024. Max values are
>> worthless indeed.

Denis> Well, that makes it run entirely in L0 cache. This is unrealistic
Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0.

Oops ...

Denis> You need to be more clever than that - generate pseudo-random
Denis> offsets in large buffer and run on ~1K pieces of that buffer.

Here it is:

Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took 8678 max, 808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took 941 max, 808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took 11604 max, 808 min cycles per kb. sum=0x400270e8
                  kernelpii_csum - took 28839 max, 664 min cycles per kb. sum=0x400270e8
                kernelpiipf_csum - took 9163 max, 665 min cycles per kb. sum=0x400270e8
                        pfm_csum - took 2788 max, 1470 min cycles per kb. sum=0x400270e8
                       pfm2_csum - took 1179 max, 915 min cycles per kb. sum=0x400270e8
copy tests:
                     kernel_copy - took 688 max, 263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took 456 max, 263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took 11241 max, 263 min cycles per kb. sum=0x400270e8
                  kernelpii_copy - took 7635 max, 246 min cycles per kb. sum=0x400270e8
                      ntqpf_copy - took 5349 max, 536 min cycles per kb. sum=0x400270e8
                     ntqpfm_copy - took 769 max, 425 min cycles per kb. sum=0x400270e8
                        ntq_copy - took 672 max, 469 min cycles per kb. sum=0x400270e8
                     ntqpf2_copy - took 8000 max, 579 min cycles per kb. sum=0x400270e8
Done

Ran on a 512K (my cache size) buffer, choosing each time a 1K
piece. (making the buffer larger (2M, 4M) does not make any
difference).

And the modified 0main.c is attached.

~velco



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Oct 31 2002 - 22:00:27 EST