I've been looking at my benchmark figures and I think I've found why the
figures for my version were different to yours. Its not your code which
is at fault, its the way it was hooked into the benchmarking program.
The compiler was inlining some parts which it shouldn't have been
allowed to do, sorry :-/.
With that issue corrected, decompression is the same speed however
compression is showing about a 9% performance loss compared to my kernel
patch.
I did some diffs of the assembler outputted by our two versions (mine
matches minilzo). For decompression the output is effectively identical.
For compression, there are significant differences. If I add a noinline
attribute to lzo1x_compress_worker, that removes a lot of them (and
boosts speed a bit) but there are still differences. Ideally, I'd like
to understand why.