RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: David Laight
Date: Thu Feb 09 2023 - 04:35:42 EST


From: Bibo Mao
> Sent: 09 February 2023 03:59
>
> loongArch platform is 64-bit system, which supports 8 bytes memory
> accessing, generic checksum function uses 4 byte memory access.
> This patch adds 8-bytes memory access optimization for checksum
> function on loongArch. And the code comes from arm64 system.

How fast do these functions actually run (in bytes/clock)?

It is quite possible that just adding 32bit values to a
64bit register is faster.
Any non-trivial cpu will run that at 4 bytes/clock
(for suitably unrolled and pipelined code).
On a more complex cpu adding to two registers will
give 8 bytes/clock (needs two memory loads/clock).

The fastest 64bit sum you'll get on anything mips-like
(no carry flag) is probably from something like:
val = *mem++; // 64bit read
sum += val;
carry = sum < val;
carry_sum += carry;
which is 2 bytes/instruction again.
To get to 8 bytes/clock you need to execute all 4 instructions
every clock - so 1 read and 3 arithmetic.
(c/f 2 read and 2 arithmetic for 32bit adds.)

Arm has a carry flag so the code is:
val = *mem++;
temp,carry = sum + val;
sum = sum + val + carry;
There are still two dependant arithmetic instructions for
each 8-byte word.
The dependencies on the flags register also make it harder
to get any benefit from interleaving adds to two registers.

x86-64 uses 64bit 'add with carry' chains.
No one ever noticed that they take two clocks each on
Intel cpu until (about) Haswell.
It is possible to get 12 bytes/clock with some strange
loops that use (IIRC) adxo and adxc.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)