Re: [PATCH 0/6] lib/lzo: performance improvements

From: Markus F.X.J. Oberhumer
Date: Wed Nov 21 2018 - 08:52:57 EST


Hi Dave,

thanks for your patch set. Just some initial comments:


I think the three patches

[PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
[PATCH 3/6] lib/lzo: 64-bit CTZ on Arm aarch64
[PATCH 4/6] lib/lzo: fast 8-byte copy on arm64

should be applied in any case - could you please make an extra
pull request out of these and try to get them merged as fast
as possible. Thanks.


The first patch

[PATCH 1/6] lib/lzo: clean-up by introducing COPY16

does not really affect the resulting code at the moment, but please
note that in one case the actual copy unit is not allowed to
be greater 8 bytes (which might be implied by the name "COPY16").
So this needs more work like an extra COPY16_BY_8() macro.


As for your your "lzo-rle" improvements I'll have a look.

Please note that the first byte value 17 is actually valid when using
external dictionaries ("lzo1x_decompress_dict_safe()" in the LZO source
code). While this functionality is not present in the Linux kernel at
the moment it might be worrisome wrt future enhancements.

Finally I'm wondering if your chart comparisions just compares the "lzo-rle"
patch or also includes the ARM64 improvments - I cannot understand where a
20% speedup should come from if you have 0% zeros.

Cheers,
Markus



On 2018-11-21 13:06, Dave Rodgman wrote:
> This patch series introduces performance improvements for lzo.
>
> The improvements fall into two categories: general Arm-specific optimisations
> (e.g., more efficient memory access); and the introduction of a special case
> for handling runs of zeros (which is a common case for zram) using run-length
> encoding.
>
> The introduction of RLE modifies the bitstream such that it can't be decoded
> by old versions of lzo (the new lzo-rle can correctly decode old bitstreams).
> To avoid possible issues where data is persisted on disk (e.g., squashfs), the
> final patch in this series separates lzo-rle into a separate algorithm
> alongside lzo, so that the new lzo-rle is (by default) only used for zram and
> must be explicitly selected for other use-cases. This final patch could be
> omitted if the consensus is that we'd rather avoid proliferation of lzo
> variants.
>
> Overall, performance is improved by around 1.1 - 4.8x (data-dependent: data
> with many zero runs shows higher improvement). Under real-world testing with
> zram, time spent in (de)compression during swapping is reduced by around 27%.
> The graph below shows the weighted round-trip throughput of lzo, lz4 and
> lzo-rle, for randomly generated 4k chunks of data with varying levels of
> entropy. (To calculate weighted round-trip throughput, compression performance
> is emphasised to reflect the fact that zram does around 2.25x more compression
> than decompression. (Results and overall trends are fairly similar for
> unweighted).
>
> https://drive.google.com/file/d/18GU4pgRVCLNN7wXxynz-8R2ygrY2IdyE/view
>
> Contributors:
> Dave Rodgman <dave.rodgman@xxxxxxx>
> Matt Sealey <matt.sealey@xxxxxxx>
>

--
Markus Oberhumer, <markus@xxxxxxxxxxxxx>, http://www.oberhumer.com/