Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Neil Horman
Date: Sat Oct 26 2013 - 09:58:53 EST

Next message: Will Deacon: "Re: linux-next: manual merge of the tip tree"
Previous message: Tetsuo Handa: "Re: [PATCH] LSM: ModPin LSM for module loading restrictions"
In reply to: Ingo Molnar: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Next in thread: Ingo Molnar: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
>
> * Neil Horman <nhorman@xxxxxxxxxxxxx> wrote:
>
> > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > >
> > > >
> > > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > > such that no interrupts (save for local ones), would occur on that cpu. Note
> > > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > > anywhere.
> > >
> > > This csum_partial_opt() was a private implementation of csum_partial()
> > > so that I could load the module without rebooting the kernel ;)
> > >
> > > >
> > > > base results:
> > > > 53569916
> > > > 43506025
> > > > 43476542
> > > > 44048436
> > > > 45048042
> > > > 48550429
> > > > 53925556
> > > > 53927374
> > > > 53489708
> > > > 53003915
> > > >
> > > > AVG = 492 ns
> > > >
> > > > prefetching only:
> > > > 53279213
> > > > 45518140
> > > > 49585388
> > > > 53176179
> > > > 44071822
> > > > 43588822
> > > > 44086546
> > > > 47507065
> > > > 53646812
> > > > 54469118
> > > >
> > > > AVG = 488 ns
> > > >
> > > >
> > > > parallel alu's only:
> > > > 46226844
> > > > 44458101
> > > > 46803498
> > > > 45060002
> > > > 46187624
> > > > 37542946
> > > > 45632866
> > > > 46275249
> > > > 45031141
> > > > 46281204
> > > >
> > > > AVG = 449 ns
> > > >
> > > >
> > > > both optimizations:
> > > > 45708837
> > > > 45631124
> > > > 45697135
> > > > 45647011
> > > > 45036679
> > > > 39418544
> > > > 44481577
> > > > 46820868
> > > > 44496471
> > > > 35523928
> > > >
> > > > AVG = 438 ns
> > > >
> > > >
> > > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > > the best savings with both optimizations (54 ns, or 10.9%).
> > > >
> > > > These results, while they've changed as we've modified the test case slightly
> > > > have remained consistent in their sppedup ordinality. Prefetching helps, but
> > > > not as much as using multiple alu's, and neither is as good as doing both
> > > > together.
> > > >
> > > > Unless you see something else that I'm doing wrong here. It seems like a win to
> > > > do both.
> > > >
> > >
> > > Well, I only said (or maybe I forgot), that on my machines, I got no
> > > improvements at all with the multiple alu or the prefetch. (I tried
> > > different strides)
> > >
> > > Only noises in the results.
> > >
> > I thought you previously said that running netperf gave you a stastically
> > significant performance boost when you added prefetching:
> > http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> >
> > But perhaps I missed a note somewhere.
> >
> > > It seems it depends on cpus and/or multiple factors.
> > >
> > > Last machine I used for the tests had :
> > >
> > > processor : 23
> > > vendor_id : GenuineIntel
> > > cpu family : 6
> > > model : 44
> > > model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
> > > stepping : 2
> > > microcode : 0x13
> > > cpu MHz : 2800.256
> > > cache size : 12288 KB
> > > physical id : 1
> > > siblings : 12
> > > core id : 10
> > > cpu cores : 6
> > >
> > >
> > >
> > >
> >
> > Thats about what I'm running with:
> > processor : 0
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 44
> > model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> > stepping : 2
> > microcode : 0x13
> > cpu MHz : 1600.000
> > cache size : 12288 KB
> > physical id : 0
> > siblings : 8
> > core id : 0
> > cpu cores : 4
> >
> >
> > I can't imagine what would cause the discrepancy in our results (a
> > 10% savings in execution time seems significant to me). My only
> > thought would be that possibly the alu's on your cpu are faster
> > than mine, and reduce the speedup obtained by preforming operation
> > in parallel, though I can't imagine thats the case with these
> > processors being so closely matched.
>
> You keep ignoring my request to calculate and account for noise of
> the measurement.
>
Don't confuse "ignoring" with "haven't gotten there yet". Sometimes we all have
to wait, Ingo. I'm working on it now, but I hit a snag on the machine I'm
working with and am trying to figure it out now.

> For example you are talking about a 0.8% prefetch effect while the
> noise in the results is obviously much larger than that, with a
> min/max distance of around 5%:
>
> > > > 43476542
> > > > 53927374
>
> so the noise of 10 measurements would be around 5-10%. (back of the
> envelope calculation)
>
> So you might be right in the end, but the posted data does not
> support your claims, statistically.
>
> It's your responsibility to come up with convincing measurements and
> results, not of those who review your work.
>
Be patient, I'm getting there

Thanks
Neil

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Will Deacon: "Re: linux-next: manual merge of the tip tree"
Previous message: Tetsuo Handa: "Re: [PATCH] LSM: ModPin LSM for module loading restrictions"
In reply to: Ingo Molnar: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Next in thread: Ingo Molnar: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]