Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
Date: Fri Feb 10 2023 - 08:30:17 EST




On 2023/2/10 19:08, David Laight wrote:
From: maobibo
Sent: 10 February 2023 10:06

With the test cases
https://github.com/bibo-mao/bench/tree/master/csum

Tested with different buffer size 4096/1472/250/40, here is the output on my
loongarch machine. Loops times is 0x100000, and time cost unit is milliseconds,
and the smaller value will be better.


buf size[4096] loops[0x100000] times[us]: csum uint128 344473 asm method 373391 uint64 741412
buf size[1472] loops[0x100000] times[us]: csum uint128 131849 asm method 138533 uint64 271317
buf size[ 250] loops[0x100000] times[us]: csum uint128 34512 asm method 36294 uint64 51576
buf size[ 40] loops[0x100000] times[us]: csum uint128 12182 asm method 23874 uint64 15769

What do those work out as in bytes/clock?

Rather than run 1000s of iterations (and be hit by interrupts etc)
I sometimes just use an accurate cycle counter and measure the
time for a single buffer (or varying length).
Save and print the values of 10 calls and you'll get pretty
consistent values after the first couple (cold cache).
Then you can work out how long each iteration of the main loop costs.

CPU freq is 2G, from result for buffer size 4096, it is 5.8 bytes/clock.
Freq of timestamp count on Loongarch is 100MHZ at constant ,and there is no cpu cycle count register on the platform.

I think you have to execute 4 instructions for each 64bit word.
One memory read, the main add, a setle and the add of the carry.

For a simple cpu that is always going to be 4 clocks.
But if there are delay slots after the memory read you can
fill them with the alu instructions for an earlier read.
You also need an add and bne for the address for each iteration.
So unrolling the loop further will help.
There is no delay slot on Loongarch platform, yes for 8bytes csum
calculation at least 4 clocks should be used.


OTOH if your cpu can execute multiple instructions in one clock
you can expect to do a lot better.
With 3 ALU instructions (and one read) you should be able to
find a code sequence that will run at 8 bytes/clock.
With 4 ALU it is likely that the loop instructions can also
execute in parallel - so you don't need massive loop unrolling.

Unless the cpu is massively 'out of order' (like some x86)
I'd expect the best code to interleave the reads and alu
operations for earlier values - rather than having all
the reads at the top of the loop.
I do not think so, like memcpy asm function, memory accessing is
put together, memory access can be stalled if not in L1 cache for
the first time. cacheline will be load at cpu read buffer, the next
memory read in the cache line will cost 1 cycle.
However I will try the method interleave the reads and alu operations.

So the loop would be a repeating pattern of instructions
with some values being carried between iterations.

I doubt you'll get a loop to execute every clock, but
a two clock loop is entirely possible.
It rather depends how fast the instruction decoder
handles the (predicted) branch.
yes, branch prediction is always expensive and hard to control :(

Regards
Bibo, Mao

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)