New csum and csum_copy routines - and a test/benchmark program

From: Denis Vlasenko (vda@port.imtp.ilyichevsk.odessa.ua)
Date: Mon Oct 28 2002 - 08:12:01 EST


I took some time to develop a little test/benchmark program
for csum and csum_copy routines (used in networking).
It has grown to include following features:

* Total buffer size #define-selectable, hence you can measure
  cache-hot and cache-cold performance.

* It does not simply checksum entire buffer, you can do it in 'chunks'.
  Chunk size is a #define too. Chunk order is randomized for eash run
  (this is done to stop fooling us with prefetch from prev chunk to next).
  But you are guaranteed to walk entire buffer.

* Buffer contents are randomized at each run. Csum correctness is checked.

* Buffer copy correctness verified for csum_copy.

* You can set random (up to a #defined value) start and end offset for each
  chunk. Gaps are poisoned before each csum_copy and verified afterwards.
  This has already caught two bugs.

* It benchmarks each routine by running it #defined number of times
  and reporting min/max cycles per kb taken.

* It is easy to add/remove C and asm test routines.

* Easily adaptable for SSE and MMX instruction sets.

* It can make coffee for you. ;)

I'm thinking on how to collect 2-5 best routines and
make 'em compete at kernel init time for the right
to be used for blazing network performance, but did not
even start to code this. Similar approach can be taken
for page clear/copy and copy to/from user routines.

Election of 'best' routine by lkml posts is:
1. Slow
2. Doesn't fit given combination of CPU/mem/mobo
so do _not_ send your results to lkml unless you think
you found something interesting.

FYI, my last results below. kpf_XXX routines are newest'n'greatest.
I found out to my surprize that shortening unrolled loop
on Duron has positive effect.

Coders with 'prefetchless' CPUs are encouraged to write up
their own versions of prefetch-like routines (you may use
mov [mem],reg as a prefetch in the hopes CPU will reorder
instructions and will happily csum older data while such mov
is waiting for data to be fetched. But this needs testing.
That's what this program is for! :-)

--
vda

Duron 650 ========= Csum benchmark program buffer size: 4 Mb Each test tried 32 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 2612 max, 1887 min cycles per kb. sum=0xfad28968 kernel_csum - took 2654 max, 1887 min cycles per kb. sum=0xfad28968 kernel_csum - took 2105 max, 1887 min cycles per kb. sum=0xfad28968 kernel2_csum - took 2636 max, 1925 min cycles per kb. sum=0xfad28968 kernelpii_csum - took 11879 max, 1735 min cycles per kb. sum=0xaeffd53b kernelpiipf_csum - took 2565 max, 1642 min cycles per kb. sum=0xaeffd53b kpf_csum - took 1280 max, 1037 min cycles per kb. sum=0xaeffd53b kpf_csum - took 1298 max, 1037 min cycles per kb. sum=0xaeffd53b kpf_csum - took 1285 max, 1035 min cycles per kb. sum=0xaeffd53b kpf_csum - took 1893 max, 1037 min cycles per kb. sum=0xaeffd53b copy tests: kernel_copy - took 5812 max, 4854 min cycles per kb. sum=0xfad28968 kernel_copy - took 5741 max, 4854 min cycles per kb. sum=0xfad28968 kernel_copy - took 17680 max, 4859 min cycles per kb. sum=0xfad28968 kernelpii_copy - took 7204 max, 6381 min cycles per kb. sum=0xe3bca07e kernelpiipf_copy - took 8429 max, 7477 min cycles per kb. sum=0xe3bca07e kpf_copy - took 12806 max, 2471 min cycles per kb. sum=0xfad28968 kpf_copy - took 3181 max, 2470 min cycles per kb. sum=0xfad28968 kpf_copy - took 3327 max, 2471 min cycles per kb. sum=0xfad28968 kpf_copy - took 11967 max, 2471 min cycles per kb. sum=0xfad28968 Done

Celeron 1200 ============ Csum benchmark program buffer size: 4 Mb Each test tried 32 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 7368 max, 6833 min cycles per kb. sum=0x291132e0 kernel_csum - took 9038 max, 6845 min cycles per kb. sum=0x291132e0 kernel_csum - took 7112 max, 6836 min cycles per kb. sum=0x291132e0 kernel2_csum - took 7254 max, 6871 min cycles per kb. sum=0x291132e0 kernelpii_csum - took 4696 max, 4109 min cycles per kb. sum=0x484713aa kernelpiipf_csum - took 4715 max, 4271 min cycles per kb. sum=0x484713aa kpf_csum - took 3295 max, 2780 min cycles per kb. sum=0x484713aa kpf_csum - took 3091 max, 2793 min cycles per kb. sum=0x484713aa kpf_csum - took 14580 max, 2833 min cycles per kb. sum=0x484713aa kpf_csum - took 3292 max, 2833 min cycles per kb. sum=0x484713aa copy tests: kernel_copy - took 13927 max,13450 min cycles per kb. sum=0x291132e0 kernel_copy - took 14009 max,13406 min cycles per kb. sum=0x291132e0 kernel_copy - took 13957 max,13447 min cycles per kb. sum=0x291132e0 kernelpii_copy - took 15039 max,11335 min cycles per kb. sum=0x5474077d kernelpiipf_copy - took 14137 max,13059 min cycles per kb. sum=0x5474077d kpf_copy - took 8226 max, 7857 min cycles per kb. sum=0x291132e0 kpf_copy - took 20698 max, 7886 min cycles per kb. sum=0x291132e0 kpf_copy - took 8504 max, 7897 min cycles per kb. sum=0x291132e0 kpf_copy - took 8245 max, 7893 min cycles per kb. sum=0x291132e0 Done


- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Oct 31 2002 - 22:00:35 EST