RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: David Laight
Date: Fri Feb 10 2023 - 06:08:23 EST


From: maobibo
> Sent: 10 February 2023 10:06
>
> With the test cases
> https://github.com/bibo-mao/bench/tree/master/csum
>
> Tested with different buffer size 4096/1472/250/40, here is the output on my
> loongarch machine. Loops times is 0x100000, and time cost unit is milliseconds,
> and the smaller value will be better.
>
>
> buf size[4096] loops[0x100000] times[us]: csum uint128 344473 asm method 373391 uint64 741412
> buf size[1472] loops[0x100000] times[us]: csum uint128 131849 asm method 138533 uint64 271317
> buf size[ 250] loops[0x100000] times[us]: csum uint128 34512 asm method 36294 uint64 51576
> buf size[ 40] loops[0x100000] times[us]: csum uint128 12182 asm method 23874 uint64 15769

What do those work out as in bytes/clock?

Rather than run 1000s of iterations (and be hit by interrupts etc)
I sometimes just use an accurate cycle counter and measure the
time for a single buffer (or varying length).
Save and print the values of 10 calls and you'll get pretty
consistent values after the first couple (cold cache).
Then you can work out how long each iteration of the main loop costs.

I think you have to execute 4 instructions for each 64bit word.
One memory read, the main add, a setle and the add of the carry.

For a simple cpu that is always going to be 4 clocks.
But if there are delay slots after the memory read you can
fill them with the alu instructions for an earlier read.
You also need an add and bne for the address for each iteration.
So unrolling the loop further will help.

OTOH if your cpu can execute multiple instructions in one clock
you can expect to do a lot better.
With 3 ALU instructions (and one read) you should be able to
find a code sequence that will run at 8 bytes/clock.
With 4 ALU it is likely that the loop instructions can also
execute in parallel - so you don't need massive loop unrolling.

Unless the cpu is massively 'out of order' (like some x86)
I'd expect the best code to interleave the reads and alu
operations for earlier values - rather than having all
the reads at the top of the loop.
So the loop would be a repeating pattern of instructions
with some values being carried between iterations.

I doubt you'll get a loop to execute every clock, but
a two clock loop is entirely possible.
It rather depends how fast the instruction decoder
handles the (predicted) branch.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)