Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
Date: Mon Feb 13 2023 - 20:31:16 EST




在 2023/2/10 19:08, David Laight 写道:
> From: maobibo
>> Sent: 10 February 2023 10:06
>>
>> With the test cases
>> https://github.com/bibo-mao/bench/tree/master/csum
>>
>> Tested with different buffer size 4096/1472/250/40, here is the output on my
>> loongarch machine. Loops times is 0x100000, and time cost unit is milliseconds,
>> and the smaller value will be better.
>>
>>
>> buf size[4096] loops[0x100000] times[us]: csum uint128 344473 asm method 373391 uint64 741412
>> buf size[1472] loops[0x100000] times[us]: csum uint128 131849 asm method 138533 uint64 271317
>> buf size[ 250] loops[0x100000] times[us]: csum uint128 34512 asm method 36294 uint64 51576
>> buf size[ 40] loops[0x100000] times[us]: csum uint128 12182 asm method 23874 uint64 15769
>
> What do those work out as in bytes/clock?
>
> Rather than run 1000s of iterations (and be hit by interrupts etc)
> I sometimes just use an accurate cycle counter and measure the
> time for a single buffer (or varying length).
> Save and print the values of 10 calls and you'll get pretty
> consistent values after the first couple (cold cache).
> Then you can work out how long each iteration of the main loop costs.
>
> I think you have to execute 4 instructions for each 64bit word.
> One memory read, the main add, a setle and the add of the carry.
>
> For a simple cpu that is always going to be 4 clocks.
> But if there are delay slots after the memory read you can
> fill them with the alu instructions for an earlier read.
> You also need an add and bne for the address for each iteration.
> So unrolling the loop further will help.
>
> OTOH if your cpu can execute multiple instructions in one clock
> you can expect to do a lot better.
> With 3 ALU instructions (and one read) you should be able to
> find a code sequence that will run at 8 bytes/clock.
> With 4 ALU it is likely that the loop instructions can also
> execute in parallel - so you don't need massive loop unrolling.
>
> Unless the cpu is massively 'out of order' (like some x86)
> I'd expect the best code to interleave the reads and alu
> operations for earlier values - rather than having all
> the reads at the top of the loop.
> So the loop would be a repeating pattern of instructions
> with some values being carried between iterations.
Part of asm code depends on previous intr in website
https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb,
such as macro ADDC
#define ADDC(sum,reg) \
ADD sum, sum, reg; \
sltu t8, sum, reg; \
ADD sum, sum, t8; \
these three instructions depends on each other, and can not execute
in parallel.

The original of main loop about Lmove_128bytes is:
#define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3) \
LOAD _t0, src, (offset + UNIT(0)); \
LOAD _t1, src, (offset + UNIT(1)); \
LOAD _t2, src, (offset + UNIT(2)); \
LOAD _t3, src, (offset + UNIT(3)); \
ADDC(_t0, _t1); \
ADDC(_t2, _t3); \
ADDC(sum, _t0); \
ADDC(sum, _t2)

.Lmove_128bytes:
CSUM_BIGCHUNK(src, 0x00, sum, t0, t1, t3, t4)
CSUM_BIGCHUNK(src, 0x20, sum, t0, t1, t3, t4)
CSUM_BIGCHUNK(src, 0x40, sum, t0, t1, t3, t4)
CSUM_BIGCHUNK(src, 0x60, sum, t0, t1, t3, t4)
addi.d t5, t5, -1
addi.d src, src, 0x80
bnez t5, .Lmove_128bytes

I modified the main loop with label .Lmove_128bytes to reduce
dependency between instructions like this, it can improve the
performance.
can improve the performance.
.Lmove_128bytes:
LOAD t0, src, 0
LOAD t1, src, 8
LOAD t3, src, 16
LOAD t4, src, 24
LOAD a3, src, 0 + 0x20
LOAD a4, src, 8 + 0x20
LOAD a5, src, 16 + 0x20
LOAD a6, src, 24 + 0x20
ADD t0, t0, t1
ADD t3, t3, t4
ADD a3, a3, a4
ADD a5, a5, a6
sltu t8, t0, t1
sltu a7, t3, t4
ADD t0, t0, t8
ADD t3, t3, a7
sltu t1, a3, a4
sltu t4, a5, a6
ADD a3, a3, t1
ADD a5, a5, t4
ADD t0, t0, t3
ADD a3, a3, a5
sltu t1, t0, t3
sltu t4, a3, a5
ADD t0, t0, t1
ADD a3, a3, t4
ADD sum, sum, t0
sltu t8, sum, t0
ADD sum, sum, t8
ADD sum, sum, a3
sltu t8, sum, a3
addi.d t5, t5, -1
ADD sum, sum, t8

However the result and principle is almost the similar with
uint128 c code. And there is no performance impact interleaving
the reads and alu operations.

Regards
Bibo, Mao

>
> I doubt you'll get a loop to execute every clock, but
> a two clock loop is entirely possible.
> It rather depends how fast the instruction decoder
> handles the (predicted) branch.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)