Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: Huacai Chen
Date: Thu Feb 09 2023 - 22:21:54 EST

Next message: TingHan Shen (沈廷翰): "Re: [PATCH v4 10/12] remoteproc: mediatek: Handle MT8195 SCP core 1 watchdog timeout"
Previous message: kernel test robot: "drivers/remoteproc/ti_k3_dsp_remoteproc.c:432:17: sparse: sparse: cast removes address space '__iomem' of expression"
In reply to: David Laight: "RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system"
Next in thread: maobibo: "Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This commit comes from the old internal kernel, I want to know which
one has better performance.

https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb

On Thu, Feb 9, 2023 at 8:39 PM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> From: maobibo
> > Sent: 09 February 2023 11:55
> >
> >
> > 在 2023/2/9 17:35, David Laight 写道:
> > > From: Bibo Mao
> > >> Sent: 09 February 2023 03:59
> > >>
> > >> loongArch platform is 64-bit system, which supports 8 bytes memory
> > >> accessing, generic checksum function uses 4 byte memory access.
> > >> This patch adds 8-bytes memory access optimization for checksum
> > >> function on loongArch. And the code comes from arm64 system.
> > >
> > > How fast do these functions actually run (in bytes/clock)?
> > With uint128 method, there will unrolled loop, instruction
> > can execute in parallel. It gets the best result on loongarch
> > system where there is no neither carry flag nor post-index
> > addressing modes.
>
> We're probably almost agreeing...
>
> > Here is the piece of disassemble code with uint128 method:
>
> Load 8 values:
>
> > 120000a40: 28c0222f ld.d $r15,$r17,8(0x8)
> > 120000a44: 28c0622a ld.d $r10,$r17,24(0x18)
> > 120000a48: 28c0a230 ld.d $r16,$r17,40(0x28)
> > 120000a4c: 28c0e232 ld.d $r18,$r17,56(0x38)
> > 120000a50: 28c0022e ld.d $r14,$r17,0
> > 120000a54: 28c0422d ld.d $r13,$r17,16(0x10)
> > 120000a58: 28c0822b ld.d $r11,$r17,32(0x20)
> > 120000a5c: 28c0c22c ld.d $r12,$r17,48(0x30)
>
> Pairwise add them
>
> > 120000a60: 0010b9f7 add.d $r23,$r15,$r14
> > 120000a64: 0010b54d add.d $r13,$r10,$r13
> > 120000a68: 0010b24c add.d $r12,$r18,$r12
> > 120000a6c: 0010ae0b add.d $r11,$r16,$r11
>
> Generate 4 'carry' bits
>
> > 120000a70: 0012c992 sltu $r18,$r12,$r18
> > 120000a74: 0012beee sltu $r14,$r23,$r15
> > 120000a78: 0012c170 sltu $r16,$r11,$r16
> > 120000a7c: 0012a9aa sltu $r10,$r13,$r10
>
> Add the carry bits onto the sums.
> I've not quite worked out which add is which!
> But I think you've missed a few adds here.
>
> > 120000a80: 0010ae0f add.d $r15,$r16,$r11
> > 120000a84: 0010ddce add.d $r14,$r14,$r23
> > 120000a88: 0010b250 add.d $r16,$r18,$r12
> > 120000a8c: 0010b54d add.d $r13,$r10,$r13
> > 120000a90: 0010b5d2 add.d $r18,$r14,$r13
> > 120000a94: 0010c1f0 add.d $r16,$r15,$r16
>
> Somewhere each value needs an add, an sltu to generate the 'carry',
> and an add for the carry itself.
> If you sum the carry bits into a separate register it is
> possible to get a both adds and the sltu (for different values)
> to run in the same clock (on a suitable cpu).
> If there are 4 integer units you can also get the loop instructions
> 'for free' and unrolling 8 times may not be needed at all.
>
> ...
> > There is no post-index addressing modes on loongarch,
> > val = *mem; // 64bit read
> > mem++;
> > sum += val;
> > carry = sum < val;
> > carry_sum += carry;
> > it takes 5 instruction and these 5 instructions depends on previous instr.
>
> I'd assume the loop was unrolled enough so the address
> increment doesn't matter.
>
> > There is the piece of disassemble code:
> > 120000d90: 28c001f0 ld.d $r16,$r15,0
> > 120000d94: 0010c58c add.d $r12,$r12,$r17
> > 120000d98: 02c021ef addi.d $r15,$r15,8(0x8)
>
> Those three instructions are independent.
>
> > 120000d9c: 0010b20c add.d $r12,$r16,$r12
>
> that one depends on the ld.d
>
> > 120000da0: 0012c191 sltu $r17,$r12,$r16
>
> that depends on the add.d
> but it could be execute after the 'bne' in parallel with the ld.d
>
> > 120000da4: 5fffedf2 bne $r15,$r18,-20(0x3ffec) # 120000d90 <do_csum_64+0x90>
>
> If you tweak the code it is possible to get down to just
> the addi.d and bne constraining the dependency chain.
> (Assuming there is no delay on the read and there are an infinite
> number of execution units.)
> Unroll once and do:
> ld.d r,addr,0
> addi.d addr,16
> ld.d r,addr,-8
> bne addr,limit,loop_top
> and you might get a loop that does a memory read every clock.
>
> So you end up worrying about how the memory read delays affect
> the instruction pipeline.
> The Intel x86 cpu I've got just pile up the arithmetic instructions
> waiting for the data to be read.
> If you get a memory read requested every clock everything else
> follows - provided you don't try to execute too many instrcutions
> at once.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

Next message: TingHan Shen (沈廷翰): "Re: [PATCH v4 10/12] remoteproc: mediatek: Handle MT8195 SCP core 1 watchdog timeout"
Previous message: kernel test robot: "drivers/remoteproc/ti_k3_dsp_remoteproc.c:432:17: sparse: sparse: cast removes address space '__iomem' of expression"
In reply to: David Laight: "RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system"
Next in thread: maobibo: "Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]