[PATCH] csum and csum_copy routines with boot-time selection .D

From: Denis Vlasenko (vda@port.imtp.ilyichevsk.odessa.ua)
Date: Sat Nov 02 2002 - 11:51:08 EST


Here is a kernel patch which replaces compile-time selection
of csum and csum_copy routines with boot-time selection.
This allows kernel to use fastest routines _for given CPU_,
regardless of copmile-time CPU seletion! You can compile for 486
and yet use faster P4 optimized code

Theory of operation: inside csum_partial() I check buffer size
and if it is large enough, I call optimized routine.

* Rigorous testsuite catches maybe 95% of all possible bugs.
  It checks misaligned, zero, very small, odd sized buffers etc.
  It compares results with known-good ones.
  If bug is caught, testsuite prints verbose debug info.
  This helps to understand and fix bug quickly.

* Testsuite is combined with benchmarking tool.

* Code is split into 'driver' and 'optimized' routines.
  Driver routine checks and handles alignment issues and corner cases.
  This reduces possibility of subtle bugs and makes optimized routines
  simpler and easier to code/understand/bugcheck.

* Driver's and optimized routines' source files are the same
  for kernel and testsuite, this reduces chances of cut'n'paste
  and similar silly bugs.

* Optimized routines are called only if block is large enough (>128 bytes).
  This avoids MMX/SSE register save/restore overhead for small blocks.
  In other words, you never get slower copy than with old kernels
  (speed is the same for small blocks) but large blocks will be handled
  much faster.

Some thoughts:

+/* Set this to value bigger than cache(s) */
+#define bufshift 20 /* 10=1kb,20=1Mb etc */
+#define chunksz (4*1024)

How to make this picked according to cache amount
(I mean reliably even for future CPUs)?

+ char *buffer = (char *) __get_free_pages(GFP_KERNEL,
+ (bufshift-PAGE_SHIFT));

hmmm... using __xx func... what is 'official' way to do it?

+#define ROUND(x) \
+SRC( movq x(%esi), %mm0 ); \
+ adcl x(%esi), %eax ; \
+ adcl x+4(%esi), %eax ; \
+DST( movntq %mm0, x(%edi) );
+// we don't need SRC() around 2nd and 3rd commands
+// (exception, if any, would be catched by 1st one)
+// (FIXME: can races against interrupts bite us?)

So, can page disappear under us here or not?

Currently code is very verbose at boot time and runs too much
benchmark repetitions for debugging purposes.

I don't know of any bugs in the code, but they can be there.
No warranties. I gave it a some testing, it didn't eat
my disk and didn't drink my coffee :)
I write this while running patched kernel on a fully
NFS mounted filesystem, so it must be usable ;)

My dmesg snippet shown below. Patch and test suite attached
(bzipped).

--
vda

dmesg ===== Linux version 2.4.20-pre11csum_t (root@localhost) (gcc version 3.2) #5 SMP Sat Nov 2 11:53:37 GMT+2 2002 .... .... IP-Config: Got DHCP answer from 255.255.255.255, my address is 172.16.42.177 IP-Config: Complete: device=eth0, addr=172.16.42.177, mask=255.255.255.0, gw=172.16.42.98, host=(none), domain=, nis-domain=(none), bootserver=255.255.255.255, rootserver=172.16.42.75, rootpath= NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. Measuring network checksumming speed allocated 256 pages count = 2 count = 2 count = 2 count = 3 count = 3 count = 3 count = 6 count = 6 count = 6 count = 11 count = 11 count = 11 count = 21 count = 21 count = 21 test run will last 16 ticks count = 54 count = 54 count = 54 basic : 345.600 MB/sec count = 29 count = 29 count = 29 simple : 185.600 MB/sec unsupported caps: 63 func 3Dnow! skipped: not supported by CPU unsupported caps: 54 func SSE/MMX+ skipped: not supported by CPU count = 108 count = 105 count = 107 SSE/MMX+ : 691.200 MB/sec csum: using csum function: SSE/MMX+ count = 21 count = 21 count = 21 basic : 134.400 MB/sec count = 17 count = 17 count = 17 simple : 108.800 MB/sec unsupported caps: 54 func SSE/MMX+ skipped: not supported by CPU count = 39 count = 39 count = 39 SSE/MMX+ : 249.600 MB/sec count = 39 count = 39 count = 39 SSE : 249.600 MB/sec csum: using csum_copy function: SSE/MMX+ freed 256 pages Looking up port of RPC 100003/2 on 172.16.42.75 Looking up port of RPC 100005/1 on 172.16.42.75 VFS: Mounted root (nfs filesystem). Mounted devfs on /dev Freeing unused kernel memory: 140k freed



- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Nov 07 2002 - 22:00:24 EST