Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Eric Dumazet
Date: Mon Oct 21 2013 - 15:44:13 EST


On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:

>
> Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> such that no interrupts (save for local ones), would occur on that cpu. Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> anywhere.

This csum_partial_opt() was a private implementation of csum_partial()
so that I could load the module without rebooting the kernel ;)

>
> base results:
> 53569916
> 43506025
> 43476542
> 44048436
> 45048042
> 48550429
> 53925556
> 53927374
> 53489708
> 53003915
>
> AVG = 492 ns
>
> prefetching only:
> 53279213
> 45518140
> 49585388
> 53176179
> 44071822
> 43588822
> 44086546
> 47507065
> 53646812
> 54469118
>
> AVG = 488 ns
>
>
> parallel alu's only:
> 46226844
> 44458101
> 46803498
> 45060002
> 46187624
> 37542946
> 45632866
> 46275249
> 45031141
> 46281204
>
> AVG = 449 ns
>
>
> both optimizations:
> 45708837
> 45631124
> 45697135
> 45647011
> 45036679
> 39418544
> 44481577
> 46820868
> 44496471
> 35523928
>
> AVG = 438 ns
>
>
> We continue to see a small savings in execution time with prefetching (4 ns, or
> about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> the best savings with both optimizations (54 ns, or 10.9%).
>
> These results, while they've changed as we've modified the test case slightly
> have remained consistent in their sppedup ordinality. Prefetching helps, but
> not as much as using multiple alu's, and neither is as good as doing both
> together.
>
> Unless you see something else that I'm doing wrong here. It seems like a win to
> do both.
>

Well, I only said (or maybe I forgot), that on my machines, I got no
improvements at all with the multiple alu or the prefetch. (I tried
different strides)

Only noises in the results.

It seems it depends on cpus and/or multiple factors.

Last machine I used for the tests had :

processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
stepping : 2
microcode : 0x13
cpu MHz : 2800.256
cache size : 12288 KB
physical id : 1
siblings : 12
core id : 10
cpu cores : 6



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/