Re: [PATCH] LoongArch: add checksum optimization for 64-bit system

From: WANG Xuerui
Date: Wed Feb 08 2023 - 08:48:49 EST


On 2023/2/8 21:12, David Laight wrote:
From: Bibo Mao
Sent: 07 February 2023 04:02

loongArch platform is 64-bit system, which supports 8 bytes memory
accessing, generic checksum function uses 4 byte memory access.
This patch adds 8-bytes memory access optimization for checksum
function on loongArch. And the code comes from arm64 system.

When network hw checksum is disabled, iperf performance improves
about 10% with this patch.

...
+static inline __sum16 csum_fold(__wsum csum)
+{
+ u32 sum = (__force u32)csum;
+
+ sum += (sum >> 16) | (sum << 16);
+ return ~(__force __sum16)(sum >> 16);
+}

Does LoongArch have a rotate instruction?
But for everything except arm (which has a rotate+add instruction)
the best is (probably):
(~sum - rol32(sum, 16)) >> 16

To the point where it is worth killing all the asm
versions and just using that one.

Yeah LoongArch can do rotates, and your suggestion can indeed reduce one insn from every invocation of csum_fold.

From this:

000000000000096c <csum_fold>:
sum += (sum >> 16) | (sum << 16);
96c: 004cc08c rotri.w $t0, $a0, 0x10
970: 00101184 add.w $a0, $t0, $a0
return ~(__force __sum16)(sum >> 16);
974: 0044c084 srli.w $a0, $a0, 0x10
978: 00141004 nor $a0, $zero, $a0
}
97c: 006f8084 bstrpick.w $a0, $a0, 0xf, 0x0
980: 4c000020 jirl $zero, $ra, 0

To:

0000000000000984 <csum_fold2>:
return (~sum - rol32(sum, 16)) >> 16;
984: 0014100c nor $t0, $zero, $a0
return (x << amt) | (x >> (32 - amt));
988: 004cc084 rotri.w $a0, $a0, 0x10
return (~sum - rol32(sum, 16)) >> 16;
98c: 00111184 sub.w $a0, $t0, $a0
}
990: 00df4084 bstrpick.d $a0, $a0, 0x1f, 0x10
994: 4c000020 jirl $zero, $ra, 0

I guess Bibo would take this suggestion and check the other arches afterwards, okay? ;-)

--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/