RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: David Laight
Date: Thu Feb 09 2023 - 07:39:18 EST


From: maobibo
> Sent: 09 February 2023 11:55
>
>
> 在 2023/2/9 17:35, David Laight 写道:
> > From: Bibo Mao
> >> Sent: 09 February 2023 03:59
> >>
> >> loongArch platform is 64-bit system, which supports 8 bytes memory
> >> accessing, generic checksum function uses 4 byte memory access.
> >> This patch adds 8-bytes memory access optimization for checksum
> >> function on loongArch. And the code comes from arm64 system.
> >
> > How fast do these functions actually run (in bytes/clock)?
> With uint128 method, there will unrolled loop, instruction
> can execute in parallel. It gets the best result on loongarch
> system where there is no neither carry flag nor post-index
> addressing modes.

We're probably almost agreeing...

> Here is the piece of disassemble code with uint128 method:

Load 8 values:

> 120000a40: 28c0222f ld.d $r15,$r17,8(0x8)
> 120000a44: 28c0622a ld.d $r10,$r17,24(0x18)
> 120000a48: 28c0a230 ld.d $r16,$r17,40(0x28)
> 120000a4c: 28c0e232 ld.d $r18,$r17,56(0x38)
> 120000a50: 28c0022e ld.d $r14,$r17,0
> 120000a54: 28c0422d ld.d $r13,$r17,16(0x10)
> 120000a58: 28c0822b ld.d $r11,$r17,32(0x20)
> 120000a5c: 28c0c22c ld.d $r12,$r17,48(0x30)

Pairwise add them

> 120000a60: 0010b9f7 add.d $r23,$r15,$r14
> 120000a64: 0010b54d add.d $r13,$r10,$r13
> 120000a68: 0010b24c add.d $r12,$r18,$r12
> 120000a6c: 0010ae0b add.d $r11,$r16,$r11

Generate 4 'carry' bits

> 120000a70: 0012c992 sltu $r18,$r12,$r18
> 120000a74: 0012beee sltu $r14,$r23,$r15
> 120000a78: 0012c170 sltu $r16,$r11,$r16
> 120000a7c: 0012a9aa sltu $r10,$r13,$r10

Add the carry bits onto the sums.
I've not quite worked out which add is which!
But I think you've missed a few adds here.

> 120000a80: 0010ae0f add.d $r15,$r16,$r11
> 120000a84: 0010ddce add.d $r14,$r14,$r23
> 120000a88: 0010b250 add.d $r16,$r18,$r12
> 120000a8c: 0010b54d add.d $r13,$r10,$r13
> 120000a90: 0010b5d2 add.d $r18,$r14,$r13
> 120000a94: 0010c1f0 add.d $r16,$r15,$r16

Somewhere each value needs an add, an sltu to generate the 'carry',
and an add for the carry itself.
If you sum the carry bits into a separate register it is
possible to get a both adds and the sltu (for different values)
to run in the same clock (on a suitable cpu).
If there are 4 integer units you can also get the loop instructions
'for free' and unrolling 8 times may not be needed at all.

...
> There is no post-index addressing modes on loongarch,
> val = *mem; // 64bit read
> mem++;
> sum += val;
> carry = sum < val;
> carry_sum += carry;
> it takes 5 instruction and these 5 instructions depends on previous instr.

I'd assume the loop was unrolled enough so the address
increment doesn't matter.

> There is the piece of disassemble code:
> 120000d90: 28c001f0 ld.d $r16,$r15,0
> 120000d94: 0010c58c add.d $r12,$r12,$r17
> 120000d98: 02c021ef addi.d $r15,$r15,8(0x8)

Those three instructions are independent.

> 120000d9c: 0010b20c add.d $r12,$r16,$r12

that one depends on the ld.d

> 120000da0: 0012c191 sltu $r17,$r12,$r16

that depends on the add.d
but it could be execute after the 'bne' in parallel with the ld.d

> 120000da4: 5fffedf2 bne $r15,$r18,-20(0x3ffec) # 120000d90 <do_csum_64+0x90>

If you tweak the code it is possible to get down to just
the addi.d and bne constraining the dependency chain.
(Assuming there is no delay on the read and there are an infinite
number of execution units.)
Unroll once and do:
ld.d r,addr,0
addi.d addr,16
ld.d r,addr,-8
bne addr,limit,loop_top
and you might get a loop that does a memory read every clock.

So you end up worrying about how the memory read delays affect
the instruction pipeline.
The Intel x86 cpu I've got just pile up the arithmetic instructions
waiting for the data to be read.
If you get a memory read requested every clock everything else
follows - provided you don't try to execute too many instrcutions
at once.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)