RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: David Laight
Date: Wed Oct 30 2013 - 10:06:11 EST

> and then I also wanted to try using both xmm and ymm registers and doing
> 64bit adds with 32bit numbers across multiple xmm/ymm registers as that
> should parallel nicely. David, you mentioned you've tried this, how did
> your experiment turn out and what was your method? I was planning on
> doing regular full size loads into one xmm/ymm register, then using
> pshufd/vshufd to move the data into two different registers, then
> summing into a fourth register, and possible running two of those
> pipelines in parallel.

It was a long time ago, and IIRC the code was just SSE so the
register length just wasn't going to give the required benefit.
I know I wrote the code, but I can't even remember whether I
actually got it working!
With the longer AVX words it might make enough difference.
Of course, this assumes that you have the fpu registers
available. If you have to do a fpu context switch it will
be a lot slower.

About the same time I did manage to an open coded copy
loop to run as fast as 'rep movs' - and without any unrolling
or any prefetch instructions.

Thinking about AVX you should be able to do (without looking up the
actual mnemonics):
add 32bit chunks to sum
compare sum with read value (equiv of carry)
add/subtract compare result (0 or ~0) to a carry-sum register
That is 4 instructions for 256 bits, so you can aim for 4 clocks.
You'd need to check the cpu book to see if any of those can
be scheduled at the same time (if not dependant).
(and also whether there is any result delay - don't think so.)

I'd try running two copies of the above - probably skewed so that
the memory accesses are separated, do the memory read for the
next iteration, and use the 3rd instruction unit for loop control.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at