Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
Date: Fri Feb 10 2023 - 05:06:36 EST


With the test cases
https://github.com/bibo-mao/bench/tree/master/csum

Tested with different buffer size 4096/1472/250/40, here is the output on my
loongarch machine. Loops times is 0x100000, and time cost unit is milliseconds,
and the smaller value will be better.


buf size[4096] loops[0x100000] times[us]: csum uint128 344473 asm method 373391 uint64 741412
buf size[1472] loops[0x100000] times[us]: csum uint128 131849 asm method 138533 uint64 271317
buf size[ 250] loops[0x100000] times[us]: csum uint128 34512 asm method 36294 uint64 51576
buf size[ 40] loops[0x100000] times[us]: csum uint128 12182 asm method 23874 uint64 15769


Regards
Bibo, Mao

在 2023/2/10 11:21, Huacai Chen 写道:
> This commit comes from the old internal kernel, I want to know which
> one has better performance.
>
> https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb
>
> On Thu, Feb 9, 2023 at 8:39 PM David Laight <David.Laight@xxxxxxxxxx> wrote:
>>
>> From: maobibo
>>> Sent: 09 February 2023 11:55
>>>
>>>
>>> 在 2023/2/9 17:35, David Laight 写道:
>>>> From: Bibo Mao
>>>>> Sent: 09 February 2023 03:59
>>>>>
>>>>> loongArch platform is 64-bit system, which supports 8 bytes memory
>>>>> accessing, generic checksum function uses 4 byte memory access.
>>>>> This patch adds 8-bytes memory access optimization for checksum
>>>>> function on loongArch. And the code comes from arm64 system.
>>>>
>>>> How fast do these functions actually run (in bytes/clock)?
>>> With uint128 method, there will unrolled loop, instruction
>>> can execute in parallel. It gets the best result on loongarch
>>> system where there is no neither carry flag nor post-index
>>> addressing modes.
>>
>> We're probably almost agreeing...
>>
>>> Here is the piece of disassemble code with uint128 method:
>>
>> Load 8 values:
>>
>>> 120000a40: 28c0222f ld.d $r15,$r17,8(0x8)
>>> 120000a44: 28c0622a ld.d $r10,$r17,24(0x18)
>>> 120000a48: 28c0a230 ld.d $r16,$r17,40(0x28)
>>> 120000a4c: 28c0e232 ld.d $r18,$r17,56(0x38)
>>> 120000a50: 28c0022e ld.d $r14,$r17,0
>>> 120000a54: 28c0422d ld.d $r13,$r17,16(0x10)
>>> 120000a58: 28c0822b ld.d $r11,$r17,32(0x20)
>>> 120000a5c: 28c0c22c ld.d $r12,$r17,48(0x30)
>>
>> Pairwise add them
>>
>>> 120000a60: 0010b9f7 add.d $r23,$r15,$r14
>>> 120000a64: 0010b54d add.d $r13,$r10,$r13
>>> 120000a68: 0010b24c add.d $r12,$r18,$r12
>>> 120000a6c: 0010ae0b add.d $r11,$r16,$r11
>>
>> Generate 4 'carry' bits
>>
>>> 120000a70: 0012c992 sltu $r18,$r12,$r18
>>> 120000a74: 0012beee sltu $r14,$r23,$r15
>>> 120000a78: 0012c170 sltu $r16,$r11,$r16
>>> 120000a7c: 0012a9aa sltu $r10,$r13,$r10
>>
>> Add the carry bits onto the sums.
>> I've not quite worked out which add is which!
>> But I think you've missed a few adds here.
>>
>>> 120000a80: 0010ae0f add.d $r15,$r16,$r11
>>> 120000a84: 0010ddce add.d $r14,$r14,$r23
>>> 120000a88: 0010b250 add.d $r16,$r18,$r12
>>> 120000a8c: 0010b54d add.d $r13,$r10,$r13
>>> 120000a90: 0010b5d2 add.d $r18,$r14,$r13
>>> 120000a94: 0010c1f0 add.d $r16,$r15,$r16
>>
>> Somewhere each value needs an add, an sltu to generate the 'carry',
>> and an add for the carry itself.
>> If you sum the carry bits into a separate register it is
>> possible to get a both adds and the sltu (for different values)
>> to run in the same clock (on a suitable cpu).
>> If there are 4 integer units you can also get the loop instructions
>> 'for free' and unrolling 8 times may not be needed at all.
>>
>> ...
>>> There is no post-index addressing modes on loongarch,
>>> val = *mem; // 64bit read
>>> mem++;
>>> sum += val;
>>> carry = sum < val;
>>> carry_sum += carry;
>>> it takes 5 instruction and these 5 instructions depends on previous instr.
>>
>> I'd assume the loop was unrolled enough so the address
>> increment doesn't matter.
>>
>>> There is the piece of disassemble code:
>>> 120000d90: 28c001f0 ld.d $r16,$r15,0
>>> 120000d94: 0010c58c add.d $r12,$r12,$r17
>>> 120000d98: 02c021ef addi.d $r15,$r15,8(0x8)
>>
>> Those three instructions are independent.
>>
>>> 120000d9c: 0010b20c add.d $r12,$r16,$r12
>>
>> that one depends on the ld.d
>>
>>> 120000da0: 0012c191 sltu $r17,$r12,$r16
>>
>> that depends on the add.d
>> but it could be execute after the 'bne' in parallel with the ld.d
>>
>>> 120000da4: 5fffedf2 bne $r15,$r18,-20(0x3ffec) # 120000d90 <do_csum_64+0x90>
>>
>> If you tweak the code it is possible to get down to just
>> the addi.d and bne constraining the dependency chain.
>> (Assuming there is no delay on the read and there are an infinite
>> number of execution units.)
>> Unroll once and do:
>> ld.d r,addr,0
>> addi.d addr,16
>> ld.d r,addr,-8
>> bne addr,limit,loop_top
>> and you might get a loop that does a memory read every clock.
>>
>> So you end up worrying about how the memory read delays affect
>> the instruction pipeline.
>> The Intel x86 cpu I've got just pile up the arithmetic instructions
>> waiting for the data to be read.
>> If you get a memory read requested every clock everything else
>> follows - provided you don't try to execute too many instrcutions
>> at once.
>>
>> David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>> Registration No: 1397386 (Wales)