Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
Date: Thu Feb 09 2023 - 07:05:44 EST




在 2023/2/9 17:35, David Laight 写道:
> From: Bibo Mao
>> Sent: 09 February 2023 03:59
>>
>> loongArch platform is 64-bit system, which supports 8 bytes memory
>> accessing, generic checksum function uses 4 byte memory access.
>> This patch adds 8-bytes memory access optimization for checksum
>> function on loongArch. And the code comes from arm64 system.
>
> How fast do these functions actually run (in bytes/clock)?
With uint128 method, there will unrolled loop, instruction
can execute in parallel. It gets the best result on loongarch
system where there is no neither carry flag nor post-index
addressing modes.

Here is the piece of disassemble code with uint128 method:
120000a40: 28c0222f ld.d $r15,$r17,8(0x8)
120000a44: 28c0622a ld.d $r10,$r17,24(0x18)
120000a48: 28c0a230 ld.d $r16,$r17,40(0x28)
120000a4c: 28c0e232 ld.d $r18,$r17,56(0x38)
120000a50: 28c0022e ld.d $r14,$r17,0
120000a54: 28c0422d ld.d $r13,$r17,16(0x10)
120000a58: 28c0822b ld.d $r11,$r17,32(0x20)
120000a5c: 28c0c22c ld.d $r12,$r17,48(0x30)
120000a60: 0010b9f7 add.d $r23,$r15,$r14
120000a64: 0010b54d add.d $r13,$r10,$r13
120000a68: 0010b24c add.d $r12,$r18,$r12
120000a6c: 0010ae0b add.d $r11,$r16,$r11
120000a70: 0012c992 sltu $r18,$r12,$r18
120000a74: 0012beee sltu $r14,$r23,$r15
120000a78: 0012c170 sltu $r16,$r11,$r16
120000a7c: 0012a9aa sltu $r10,$r13,$r10
120000a80: 0010ae0f add.d $r15,$r16,$r11
120000a84: 0010ddce add.d $r14,$r14,$r23
120000a88: 0010b250 add.d $r16,$r18,$r12
120000a8c: 0010b54d add.d $r13,$r10,$r13
120000a90: 0010b5d2 add.d $r18,$r14,$r13
120000a94: 0010c1f0 add.d $r16,$r15,$r16

>
> It is quite possible that just adding 32bit values to a
> 64bit register is faster.
> Any non-trivial cpu will run that at 4 bytes/clock
> (for suitably unrolled and pipelined code).
> On a more complex cpu adding to two registers will
> give 8 bytes/clock (needs two memory loads/clock).
>
> The fastest 64bit sum you'll get on anything mips-like
> (no carry flag) is probably from something like:
> val = *mem++; // 64bit read
> sum += val;
> carry = sum < val;
> carry_sum += carry;
> which is 2 bytes/instruction again.
> To get to 8 bytes/clock you need to execute all 4 instructions
> every clock - so 1 read and 3 arithmetic.
There is no post-index addressing modes on loongarch,
val = *mem; // 64bit read
mem++;
sum += val;
carry = sum < val;
carry_sum += carry;
it takes 5 instruction and these 5 instructions depends on previous instr.
There is the piece of disassemble code:
120000d90: 28c001f0 ld.d $r16,$r15,0
120000d94: 0010c58c add.d $r12,$r12,$r17
120000d98: 02c021ef addi.d $r15,$r15,8(0x8)
120000d9c: 0010b20c add.d $r12,$r16,$r12
120000da0: 0012c191 sltu $r17,$r12,$r16
120000da4: 5fffedf2 bne $r15,$r18,-20(0x3ffec) # 120000d90 <do_csum_64+0x90>

regards
bibo, mao


> (c/f 2 read and 2 arithmetic for 32bit adds.)
>
> Arm has a carry flag so the code is:
> val = *mem++;
> temp,carry = sum + val;
> sum = sum + val + carry;
> There are still two dependant arithmetic instructions for
> each 8-byte word.
> The dependencies on the flags register also make it harder
> to get any benefit from interleaving adds to two registers.
>
> x86-64 uses 64bit 'add with carry' chains.
> No one ever noticed that they take two clocks each on
> Intel cpu until (about) Haswell.
> It is possible to get 12 bytes/clock with some strange
> loops that use (IIRC) adxo and adxc.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)