Re: [PATCH v2] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

From: Noah Goldstein
Date: Sat Nov 27 2021 - 01:42:38 EST


On Sat, Nov 27, 2021 at 12:03 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> On Fri, Nov 26, 2021 at 8:25 PM Noah Goldstein <goldstein.w.n@xxxxxxxxx> wrote:
> >
> > Modify the 8x loop to that it uses two independent
> > accumulators. Despite adding more instructions the latency and
> > throughput of the loop is improved because the `adc` chains can now
> > take advantage of multiple execution units.
> >
> > Make the memory clobbers more precise. 'buff' is read only and we know
> > the exact usage range. There is no reason to write-clobber all memory.
> >
> > Relative performance changes on Tigerlake:
> >
> > Time Unit: Ref Cycles
> > Size Unit: Bytes
> >
> > size, lat old, lat new, tput old, tput new
> > 0, 4.961, 4.901, 4.887, 4.951
> > 8, 5.590, 5.620, 4.227, 4.252
> > 16, 6.182, 6.202, 4.233, 4.278
> > 24, 7.392, 7.380, 4.256, 4.279
> > 32, 7.371, 7.390, 4.550, 4.537
> > 40, 8.621, 8.601, 4.862, 4.836
> > 48, 9.406, 9.374, 5.206, 5.234
> > 56, 10.535, 10.522, 5.416, 5.447
> > 64, 10.000, 7.590, 6.946, 6.989
> > 100, 14.218, 12.476, 9.429, 9.441
> > 200, 22.115, 16.937, 13.088, 12.852
> > 300, 31.826, 24.640, 19.383, 18.230
> > 400, 39.016, 28.133, 23.223, 21.304
> > 500, 48.815, 36.186, 30.331, 27.104
> > 600, 56.732, 40.120, 35.899, 30.363
> > 700, 66.623, 48.178, 43.044, 36.400
> > 800, 73.259, 51.171, 48.564, 39.173
> > 900, 82.821, 56.635, 58.592, 45.162
> > 1000, 90.780, 63.703, 65.658, 48.718
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@xxxxxxxxx>
> >
> > tmp
>
> SGTM (not sure what this 'tmp' string means here :) )
>
> Reviewed-by: Eric Dumazet <edumazet@xxxxxxxxxx>

Poor rebasing practices :/

Fixed in V3 (only change).