RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: David Laight
Date: Thu Feb 16 2023 - 04:03:16 EST


From: maobibo
> Sent: 14 February 2023 14:19
...
> Got it. It makes use of pipeline better, rather than number of ALUs for
> different micro-architectures. I will try this method, thanks again for
> kindly help and explanation with patience.

It is also worth pointing out that if the cpu does 'out of order'
execution it may be just as good to just repeat blocks of:
load v0, addr, 0*8
add sum0, v0
sltu v0, sum0, v0
add carry0, v0

Assuming the prefetch/decode logic can predict the loop
and generate enough decoded instruction for all the alu units.

The add/sltu/add will be queued until the load completes
and then execute in the next three clocks.
The load for the next block will be scheduled as soon as
the load/store unit has finished processing the previous load.
So all the alu instructions just wait for the required input
to be available and a memory load executes every clock.

Multiple sum0 and carry0 registers aren't actually needed.
But having 2 of each (even if the loop is unrolled 4 times)
might help a bit.

If the cpu does 'register renaming' (as most x86 do) you
can use the same register name for 'v0' in all the blocks
(even though it is alive with multiple values).

But a simpler in-order multi-issue cpu will need you to
correctly interleave the instructions for maximum throughput.
It also does no hard for a very simple cpu that has delays
before a read value can be used.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)