Re: [PATCH 1/1] LZ4: Port LZ4 1.9.x FAST_DEC_LOOP and enable it on x86 and ARM64

From: Cyan
Date: Wed May 15 2019 - 13:05:23 EST


Re-posted,
it seems the previous message was rejected by the linux-kernel server
due to some kind of format limitation (no html).


Le mer. 15 mai 2019 Ã 09:56, Cyan <yann.collet.73@xxxxxxxxx> a Ãcrit :
>
> The v1.9.0 version has a bug which makes it read a few bytes out of bound in certain cases.
> This was fixed in v1.9.1.
> Therefore, if you plan upgrading version, skip directly to v1.9.1.
>
> Note that, in v1.9.1, LZ4_decompress_fast() is now deprecated.
> LZ4_decompress_fast() is a security liability, since it's unable to cope with malicious inputs.
> While this is not a problem when the producer is under direct control (local production and consumption, no intermediate storage),
> this entry point seems misused in multiple scenarios where input cannot be decently trusted.
> The deprecation warning is meant to "scare away" such usages.
> It's still possible to intentionally remove the warning, for users who know what they are doing.
>
> Note that, on top of that, the _fast() variant is now slower that LZ4_decompress_safe(), removing its most important positive differentiator.
> That's because LZ4_decompress_fast() is "blind", it doesn't know the input size.
> It just guesses it, from the requested output size, and by relying on end-of-block conditions.
> This effectively limits its capability to copy up to 8 bytes at a time,
> which proves slower than the 32-bytes at a time used in LZ4_decompress_safe().
>
> LZ4_decompress_fast() still has some possible usage :
> it doesn't need the compressed size.
> Therefore, in a scenario where decompressed size is known, but compressed size isn't,
> it cannot be replaced by `LZ4_decompress_safe()`.
> This is pretty rare, but can happen in some cases, erofs for example.
> `LZ4_decompress_safe_partial()` *might* be able to replace it for such scenario,
> but I haven't thoroughly tested it, so can't yet vouch for it.
>
> If it doesn't, then it will be necessary to create a new entry point.
> The main difference is that this new entry point will need a bound for the input size, in order to guarantee it never reads beyond input buffer.
> But as said, for the time being, such dedicated entry point doesn't exist.
>
> The new FAST_DEC_LOOP setting works well on x86 and x64.
> Recently, it has been extended to gcc+arm64 (`dev` branch).
> The combination of clang+arm64 is less successful though (impacts vary, depending on version and exact hardware), so it's still disabled by default.
> But anyone can test it by setting FAST_DEC_LOOP definition to 1, thus enabling the new decoder loop.
>
>
> Y.
>
> Le mar. 14 mai 2019 Ã 20:28, Mao, Chenxi <Chenxi.Mao@xxxxxxxx> a Ãcrit :
>>
>> Hi Xiang:
>>
>> Thanks for your reply, I will have a stress test on my device later.
>> I didn't have chance to test LZ4 with clang build because of device limitation. I think I could do it later.
>> I guess the clang performance downgrade is caused by some compiler optimization options.
>> I will double check it later if I got the device which can build kernel with clang.
>>
>> BTW, the FAST_DEC_LOOP leverage LZ4_wildCopy8 instead of LZ4_wildCopy,
>> however based on my test, original LZ4_wildCopy has the better performance.
>> Original LZ4_wildCopy API still invoked by FAST_DEC_LOOP apis.
>> That might be a problem for X86 devices for current patch.
>>
>> Chenxi
>>
>> -----Original Message-----
>> From: Gao Xiang [mailto:gaoxiang25@xxxxxxxxxx]
>> Sent: Wednesday, May 15, 2019 10:21 AM
>> To: Mao, Chenxi <Chenxi.Mao@xxxxxxxx>
>> Cc: akpm@xxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Feng, Roy <Roy.Feng@xxxxxxxx>; Xu, Yuanli 2 <Yuanli.Xu@xxxxxxxx>; Alm, Robert 2 <Robert.Alm@xxxxxxxx>; Takahashi, Masaya A (Sony Mobile) <Masaya.A.Takahashi@xxxxxxxx>; Miao Xie <miaoxie@xxxxxxxxxx>
>> Subject: Re: [PATCH 1/1] LZ4: Port LZ4 1.9.x FAST_DEC_LOOP and enable it on x86 and ARM64
>>
>> Hi Chenxi,
>>
>> On 2019/5/15 8:43, Chenxi Mao wrote:
>> > FAST_DEC_LOOP was introduced from LZ4 1.9.
>> > This change would be introduce 10% on decompress operation according
>> > to LZ4 benchmark result on X86 devices.
>> > Meanwhile, LZ4 with FAST_DEC_LOOP could get improvements, however
>> > clang compiler has downgrade if FAST_DEC_LOOP enabled.
>>
>> I noticed this optimization and lz4-1.9.0 [1] was just released month ago with this big change therefore I'm a little afraid of its current stability (especially some rare used decompression apis...)
>>
>> Could you Cc more people (e.g. lz4 original author Yann Collet) in order to get some more ideas about this if possible?
>>
>> Anyway, I like this optimization as well since FAST_DEC_LOOP improves decompression speed a lot :)
>>
>> [1] https://github.com/lz4/lz4/releases/tag/v1.9.0
>>
>> Thanks,
>> Gao Xiang
>>
>> >
>> > So FAST_DEC_LOOP only enabled on X86/X86-64 or ARM64 with GCC build.
>> >
>> > Here is the test result on ARM64(cortex-A53)
>> >
>> > Benchmark via ZRAM:
>> >
>> > Test case:
>> > fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers \
>> > --buffer_compress_percentage=75 \
>> > --scramble_buffers=1 --direct=1 --loops=100 --numjobs=8 \
>> > --filename=/dev/block/zram0 --name=seq-write --rw=write --stonewall \
>> > --name=seq-read --rw=read --stonewall --name=seq-readwrite \ --rw=rw
>> > --stonewall --name=rand-readwrite --rw=randrw --stonewall
>> >
>> > Patched:
>> > READ: bw=7077MiB/s (7421MB/s)
>> > Vanilla:
>> > READ: bw=5134MiB/s (5384MB/s)
>> >
>> > Reference:
>> > 1. https://github.com/lz4/lz4/pull/645
>> > 2. https://github.com/lz4/lz4/pull/707
>> >
>> > Signed-off-by: chenxi.mao <chenxi.mao@xxxxxxxx>
>> > ---
>> > lib/lz4/lz4_decompress.c | 425
>> > +++++++++++++++++++++++++++++++++------
>> > 1 file changed, 361 insertions(+), 64 deletions(-)
>> >
>> > diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c index
>> > 0c9d3ad17e0f..a4c87e32b3c0 100644
>> > --- a/lib/lz4/lz4_decompress.c
>> > +++ b/lib/lz4/lz4_decompress.c
>> > @@ -50,6 +50,131 @@
>> > #define assert(condition) ((void)0)
>> > #endif
>> >
>> > +#ifndef LZ4_FAST_DEC_LOOP
>> > +#if defined(__i386__) || defined(__x86_64__) #define
>> > +LZ4_FAST_DEC_LOOP 1 #elif defined(__aarch64__) && !defined(__clang__)
>> > + /* On aarch64, we disable this optimization for clang because on certain
>> > + * mobile chipsets and clang, it reduces performance. For more information
>> > + * refer to https://github.com/lz4/lz4/pull/707. */ #define
>> > +LZ4_FAST_DEC_LOOP 1 #else #define LZ4_FAST_DEC_LOOP 0 #endif #endif
>> > +
>> > +static const unsigned inc32table[8] = { 0, 1, 2, 1, 0, 4, 4, 4 };
>> > +static const int dec64table[8] = { 0, 0, 0, -1, -4, 1, 2, 3 };
>> > +
>> > +#if LZ4_FAST_DEC_LOOP
>> > +#define FASTLOOP_SAFE_DISTANCE 64
>> > +FORCE_INLINE void
>> > +LZ4_memcpy_using_offset_base(BYTE * dstPtr, const BYTE * srcPtr, BYTE * dstEnd,
>> > + const size_t offset)
>> > +{
>> > + if (offset < 8) {
>> > + dstPtr[0] = srcPtr[0];
>> > +
>> > + dstPtr[1] = srcPtr[1];
>> > + dstPtr[2] = srcPtr[2];
>> > + dstPtr[3] = srcPtr[3];
>> > + srcPtr += inc32table[offset];
>> > + memcpy(dstPtr + 4, srcPtr, 4);
>> > + srcPtr -= dec64table[offset];
>> > + dstPtr += 8;
>> > + } else {
>> > + memcpy(dstPtr, srcPtr, 8);
>> > + dstPtr += 8;
>> > + srcPtr += 8;
>> > + }
>> > +
>> > + LZ4_wildCopy(dstPtr, srcPtr, dstEnd); }
>> > +
>> > +/* customized variant of memcpy, which can overwrite up to 32 bytes
>> > +beyond dstEnd
>> > + * this version copies two times 16 bytes (instead of one time 32
>> > +bytes)
>> > + * because it must be compatible with offsets >= 16. */ FORCE_INLINE
>> > +void LZ4_wildCopy32(void *dstPtr, const void *srcPtr, void *dstEnd) {
>> > + BYTE *d = (BYTE *) dstPtr;
>> > + const BYTE *s = (const BYTE *)srcPtr;
>> > + BYTE *const e = (BYTE *) dstEnd;
>> > +
>> > + do {
>> > + memcpy(d, s, 16);
>> > + memcpy(d + 16, s + 16, 16);
>> > + d += 32;
>> > + s += 32;
>> > + } while (d < e);
>> > +}
>> > +
>> > +FORCE_INLINE void
>> > +LZ4_memcpy_using_offset(BYTE *dstPtr, const BYTE *srcPtr, BYTE *dstEnd,
>> > + const size_t offset)
>> > +{
>> > + BYTE v[8];
>> > + switch (offset) {
>> > +
>> > + case 1:
>> > + memset(v, *srcPtr, 8);
>> > + goto copy_loop;
>> > + case 2:
>> > + memcpy(v, srcPtr, 2);
>> > + memcpy(&v[2], srcPtr, 2);
>> > + memcpy(&v[4], &v[0], 4);
>> > + goto copy_loop;
>> > + case 4:
>> > + memcpy(v, srcPtr, 4);
>> > + memcpy(&v[4], srcPtr, 4);
>> > + goto copy_loop;
>> > + default:
>> > + LZ4_memcpy_using_offset_base(dstPtr, srcPtr, dstEnd, offset);
>> > + return;
>> > + }
>> > +
>> > + copy_loop:
>> > + memcpy(dstPtr, v, 8);
>> > + dstPtr += 8;
>> > + while (dstPtr < dstEnd) {
>> > + memcpy(dstPtr, v, 8);
>> > + dstPtr += 8;
>> > + }
>> > +}
>> > +
>> > +/* Read the variable-length literal or match length.
>> > + *
>> > + * ip - pointer to use as input.
>> > + * lencheck - end ip. Return an error if ip advances >= lencheck.
>> > + * loop_check - check ip >= lencheck in body of loop. Returns loop_error if so.
>> > + * initial_check - check ip >= lencheck before start of loop. Returns initial_error if so.
>> > + * error (output) - error code. Should be set to 0 before call.
>> > + */
>> > +typedef enum { loop_error = -2, initial_error = -1, ok = 0}
>> > +variable_length_error; FORCE_INLINE unsigned read_variable_length(const BYTE **ip,
>> > + const BYTE *lencheck,
>> > + int loop_check, int initial_check,
>> > + variable_length_error *error) {
>> > + unsigned length = 0;
>> > + unsigned s;
>> > + if (initial_check && unlikely((*ip) >= lencheck)) { /* overflow detection */
>> > + *error = initial_error;
>> > + return length;
>> > + }
>> > + do {
>> > + s = **ip;
>> > + (*ip)++;
>> > + length += s;
>> > + if (loop_check && unlikely((*ip) >= lencheck)) { /* overflow detection */
>> > + *error = loop_error;
>> > + return length;
>> > + }
>> > + } while (s == 255);
>> > +
>> > + return length;
>> > +}
>> > +#endif
>> > +
>> > /*
>> > * LZ4_decompress_generic() :
>> > * This generic decompression function covers all use cases.
>> > @@ -80,25 +205,28 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> > const size_t dictSize
>> > )
>> > {
>> > - const BYTE *ip = (const BYTE *) src;
>> > - const BYTE * const iend = ip + srcSize;
>> > + const BYTE *ip = (const BYTE *)src;
>> > + const BYTE *const iend = ip + srcSize;
>> >
>> > BYTE *op = (BYTE *) dst;
>> > - BYTE * const oend = op + outputSize;
>> > + BYTE *const oend = op + outputSize;
>> > BYTE *cpy;
>> >
>> > - const BYTE * const dictEnd = (const BYTE *)dictStart + dictSize;
>> > - static const unsigned int inc32table[8] = {0, 1, 2, 1, 0, 4, 4, 4};
>> > - static const int dec64table[8] = {0, 0, 0, -1, -4, 1, 2, 3};
>> > + const BYTE *const dictEnd = (const BYTE *)dictStart + dictSize;
>> >
>> > const int safeDecode = (endOnInput == endOnInputSize);
>> > const int checkOffset = ((safeDecode) && (dictSize < (int)(64 *
>> > KB)));
>> >
>> > /* Set up the "end" pointers for the shortcut. */
>> > const BYTE *const shortiend = iend -
>> > - (endOnInput ? 14 : 8) /*maxLL*/ - 2 /*offset*/;
>> > + (endOnInput ? 14 : 8) /*maxLL*/ - 2 /*offset*/;
>> > const BYTE *const shortoend = oend -
>> > - (endOnInput ? 14 : 8) /*maxLL*/ - 18 /*maxML*/;
>> > + (endOnInput ? 14 : 8) /*maxLL*/ - 18 /*maxML*/;
>> > +
>> > + const BYTE *match;
>> > + size_t offset;
>> > + unsigned int token;
>> > + size_t length;
>> >
>> > DEBUGLOG(5, "%s (srcSize:%i, dstSize:%i)", __func__,
>> > srcSize, outputSize);
>> > @@ -117,15 +245,194 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> > if ((endOnInput) && unlikely(srcSize == 0))
>> > return -1;
>> >
>> > - /* Main Loop : decode sequences */
>> > +#if LZ4_FAST_DEC_LOOP
>> > + if ((oend - op) < FASTLOOP_SAFE_DISTANCE) {
>> > + DEBUGLOG(6, "skip fast decode loop");
>> > + goto safe_decode;
>> > + }
>> > +
>> > + /* Fast loop : decode sequences as long as output <
>> > +iend-FASTLOOP_SAFE_DISTANCE */
>> > while (1) {
>> > - size_t length;
>> > - const BYTE *match;
>> > - size_t offset;
>> > + /* Main fastloop assertion: We can always wildcopy FASTLOOP_SAFE_DISTANCE */
>> > + assert(oend - op >= FASTLOOP_SAFE_DISTANCE);
>> > + if (endOnInput) {
>> > + assert(ip < iend);
>> > + }
>> > + token = *ip++;
>> > + length = token >> ML_BITS; /* literal length */
>> > +
>> > + assert(!endOnInput || ip <= iend); /* ip < iend before the increment */
>> > +
>> > + /* decode literal length */
>> > + if (length == RUN_MASK) {
>> > + variable_length_error error = ok;
>> > + length +=
>> > + read_variable_length(&ip, iend - RUN_MASK,
>> > + endOnInput, endOnInput,
>> > + &error);
>> > + if (error == initial_error) {
>> > + goto _output_error;
>> > + }
>> > + if ((safeDecode)
>> > + && unlikely((uptrval) (op) + length <
>> > + (uptrval) (op))) {
>> > + goto _output_error;
>> > + } /* overflow detection */
>> > + if ((safeDecode)
>> > + && unlikely((uptrval) (ip) + length <
>> > + (uptrval) (ip))) {
>> > + goto _output_error;
>> > + }
>> > +
>> > + /* overflow detection */
>> > + /* copy literals */
>> > + cpy = op + length;
>> > + LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH);
>> > + if (endOnInput) { /* LZ4_decompress_safe() */
>> > + if ((cpy > oend - 32)
>> > + || (ip + length > iend - 32)) {
>> > + goto safe_literal_copy;
>> > + }
>> > + LZ4_wildCopy32(op, ip, cpy);
>> > + } else { /* LZ4_decompress_fast() */
>> > + if (cpy > oend - 8) {
>> > + goto safe_literal_copy;
>> > + }
>> > + LZ4_wildCopy(op, ip, cpy);
>> > + /* LZ4_decompress_fast() cannot copy more than 8 bytes at a time */
>> > + /* it doesn't know input length, and only relies on end-of-block */
>> > + /* properties */
>> > + }
>> > + ip += length;
>> > + op = cpy;
>> > + } else {
>> > + cpy = op + length;
>> > + if (endOnInput) { /* LZ4_decompress_safe() */
>> > + DEBUGLOG(7,
>> > + "copy %u bytes in a 16-bytes stripe",
>> > + (unsigned)length);
>> > + /* We don't need to check oend */
>> > + /* since we check it once for each loop below */
>> > + if (ip > iend - (16 + 1)) { /*max lit + offset + nextToken */
>> > + goto safe_literal_copy;
>> > + }
>> > + /* Literals can only be 14, but hope compilers optimize */
>> > + /*if we copy by a register size */
>> > + memcpy(op, ip, 16);
>> > + } else {
>> > + /* LZ4_decompress_fast() cannot copy more than 8 bytes at a time */
>> > + /* it doesn't know input length, and relies on end-of-block */
>> > + /* properties */
>> > + memcpy(op, ip, 8);
>> > + if (length > 8) {
>> > + memcpy(op + 8, ip + 8, 8);
>> > + }
>> > + }
>> > + ip += length;
>> > + op = cpy;
>> > + }
>> > +
>> > + /* get offset */
>> > + offset = LZ4_readLE16(ip);
>> > + ip += 2; /* end-of-block condition violated */
>> > + match = op - offset;
>> > +
>> > + /* get matchlength */
>> > + length = token & ML_MASK;
>> >
>> > - /* get literal length */
>> > - unsigned int const token = *ip++;
>> > - length = token>>ML_BITS;
>> > + if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) {
>> > + goto _output_error;
>> > + }
>> > + /* Error : offset outside buffers */
>> > + if (length == ML_MASK) {
>> > + variable_length_error error = ok;
>> > + length +=
>> > + read_variable_length(&ip, iend - LASTLITERALS + 1,
>> > + endOnInput, 0, &error);
>> > + if (error != ok) {
>> > + goto _output_error;
>> > + }
>> > + if ((safeDecode)
>> > + && unlikely((uptrval) (op) + length < (uptrval) op)) {
>> > + goto _output_error;
>> > + } /* overflow detection */
>> > + length += MINMATCH;
>> > + if (op + length >= oend - FASTLOOP_SAFE_DISTANCE) {
>> > + goto safe_match_copy;
>> > + }
>> > + } else {
>> > + length += MINMATCH;
>> > + if (op + length >= oend - FASTLOOP_SAFE_DISTANCE) {
>> > + goto safe_match_copy;
>> > + }
>> > +
>> > + /* Fastpath check: Avoids a branch in LZ4_wildCopy32 if true */
>> > + if (!(dict == usingExtDict) || (match >= lowPrefix)) {
>> > + if (offset >= 8) {
>> > + memcpy(op, match, 8);
>> > + memcpy(op + 8, match + 8, 8);
>> > + memcpy(op + 16, match + 16, 2);
>> > + op += length;
>> > + continue;
>> > + }
>> > + }
>> > + }
>> > +
>> > + /* match starting within external dictionary */
>> > + if ((dict == usingExtDict) && (match < lowPrefix)) {
>> > + if (unlikely(op + length > oend - LASTLITERALS)) {
>> > + if (partialDecoding) {
>> > + /* reach end of buffer */
>> > + length =
>> > + min(length, (size_t) (oend - op));
>> > + } else {
>> > + /* end-of-block condition violated */
>> > + goto _output_error;
>> > + }
>> > + }
>> > +
>> > + if (length <= (size_t) (lowPrefix - match)) {
>> > + /* match fits entirely within external dictionary : just copy */
>> > + memmove(op, dictEnd - (lowPrefix - match), length);
>> > + op += length;
>> > + } else {
>> > + /* match stretches into both external dict and current block */
>> > + size_t const copySize =
>> > + (size_t) (lowPrefix - match);
>> > + size_t const restSize = length - copySize;
>> > + memcpy(op, dictEnd - copySize, copySize);
>> > + op += copySize;
>> > + if (restSize > (size_t) (op - lowPrefix)) { /* overlap copy */
>> > + BYTE *const endOfMatch = op + restSize;
>> > + const BYTE *copyFrom = lowPrefix;
>> > + while (op < endOfMatch) {
>> > + *op++ = *copyFrom++;
>> > + }
>> > + } else {
>> > + memcpy(op, lowPrefix, restSize);
>> > + op += restSize;
>> > + }
>> > + }
>> > + continue;
>> > + }
>> > +
>> > + /* copy match within block */
>> > + cpy = op + length;
>> > +
>> > + assert((op <= oend) && (oend - op >= 32));
>> > + if (unlikely(offset < 16)) {
>> > + LZ4_memcpy_using_offset(op, match, cpy, offset);
>> > + } else {
>> > + LZ4_wildCopy32(op, match, cpy);
>> > + }
>> > +
>> > + op = cpy; /* wildcopy correction */
>> > + }
>> > + safe_decode:
>> > +#endif
>> > + /* Main Loop : decode sequences */
>> > + while (1) {
>> > + length = token >> ML_BITS;
>> >
>> > /* ip < iend before the increment */
>> > assert(!endOnInput || ip <= iend);
>> > @@ -143,26 +450,27 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> > * combined check for both stages).
>> > */
>> > if ((endOnInput ? length != RUN_MASK : length <= 8)
>> > - /*
>> > - * strictly "less than" on input, to re-enter
>> > - * the loop with at least one byte
>> > - */
>> > - && likely((endOnInput ? ip < shortiend : 1) &
>> > - (op <= shortoend))) {
>> > + /*
>> > + * strictly "less than" on input, to re-enter
>> > + * the loop with at least one byte
>> > + */
>> > + && likely((endOnInput ? ip < shortiend : 1) &
>> > + (op <= shortoend))) {
>> > /* Copy the literals */
>> > memcpy(op, ip, endOnInput ? 16 : 8);
>> > - op += length; ip += length;
>> > + op += length;
>> > + ip += length;
>> >
>> > /*
>> > * The second stage:
>> > * prepare for match copying, decode full info.
>> > * If it doesn't work out, the info won't be wasted.
>> > */
>> > - length = token & ML_MASK; /* match length */
>> > + length = token & ML_MASK; /* match length */
>> > offset = LZ4_readLE16(ip);
>> > ip += 2;
>> > match = op - offset;
>> > - assert(match <= op); /* check overflow */
>> > + assert(match <= op); /* check overflow */
>> >
>> > /* Do not deal with overlapping matches. */
>> > if ((length != ML_MASK) &&
>> > @@ -187,28 +495,24 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> >
>> > /* decode literal length */
>> > if (length == RUN_MASK) {
>> > - unsigned int s;
>> >
>> > - if (unlikely(endOnInput ? ip >= iend - RUN_MASK : 0)) {
>> > - /* overflow detection */
>> > + variable_length_error error = ok;
>> > + length +=
>> > + read_variable_length(&ip, iend - RUN_MASK,
>> > + endOnInput, endOnInput,
>> > + &error);
>> > + if (error == initial_error)
>> > goto _output_error;
>> > - }
>> > - do {
>> > - s = *ip++;
>> > - length += s;
>> > - } while (likely(endOnInput
>> > - ? ip < iend - RUN_MASK
>> > - : 1) & (s == 255));
>> >
>> > if ((safeDecode)
>> > - && unlikely((uptrval)(op) +
>> > - length < (uptrval)(op))) {
>> > + && unlikely((uptrval) (op) +
>> > + length < (uptrval) (op))) {
>> > /* overflow detection */
>> > goto _output_error;
>> > }
>> > if ((safeDecode)
>> > - && unlikely((uptrval)(ip) +
>> > - length < (uptrval)(ip))) {
>> > + && unlikely((uptrval) (ip) +
>> > + length < (uptrval) (ip))) {
>> > /* overflow detection */
>> > goto _output_error;
>> > }
>> > @@ -216,11 +520,15 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> >
>> > /* copy literals */
>> > cpy = op + length;
>> > +#if LZ4_FAST_DEC_LOOP
>> > + safe_literal_copy:
>> > +#endif
>> > LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH);
>> >
>> > if (((endOnInput) && ((cpy > oend - MFLIMIT)
>> > - || (ip + length > iend - (2 + 1 + LASTLITERALS))))
>> > - || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
>> > + || (ip + length >
>> > + iend - (2 + 1 + LASTLITERALS))))
>> > + || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) {
>> > if (partialDecoding) {
>> > if (cpy > oend) {
>> > /*
>> > @@ -231,7 +539,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> > length = oend - op;
>> > }
>> > if ((endOnInput)
>> > - && (ip + length > iend)) {
>> > + && (ip + length > iend)) {
>> > /*
>> > * Error :
>> > * read attempt beyond
>> > @@ -241,7 +549,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> > }
>> > } else {
>> > if ((!endOnInput)
>> > - && (cpy != oend)) {
>> > + && (cpy != oend)) {
>> > /*
>> > * Error :
>> > * block decoding must
>> > @@ -250,7 +558,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> > goto _output_error;
>> > }
>> > if ((endOnInput)
>> > - && ((ip + length != iend)
>> > + && ((ip + length != iend)
>> > || (cpy > oend))) {
>> > /*
>> > * Error :
>> > @@ -288,29 +596,14 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> > goto _output_error;
>> > }
>> >
>> > - /* costs ~1%; silence an msan warning when offset == 0 */
>> > - /*
>> > - * note : when partialDecoding, there is no guarantee that
>> > - * at least 4 bytes remain available in output buffer
>> > - */
>> > - if (!partialDecoding) {
>> > - assert(oend > op);
>> > - assert(oend - op >= 4);
>> > -
>> > - LZ4_write32(op, (U32)offset);
>> > - }
>> > -
>> > if (length == ML_MASK) {
>> > - unsigned int s;
>> > -
>> > - do {
>> > - s = *ip++;
>> > -
>> > - if ((endOnInput) && (ip > iend - LASTLITERALS))
>> > - goto _output_error;
>> >
>> > - length += s;
>> > - } while (s == 255);
>> > + variable_length_error error = ok;
>> > + length +=
>> > + read_variable_length(&ip, iend - LASTLITERALS + 1,
>> > + endOnInput, 0, &error);
>> > + if (error != ok)
>> > + goto _output_error;
>> >
>> > if ((safeDecode)
>> > && unlikely(
>> > @@ -322,6 +615,10 @@ static FORCE_INLINE int LZ4_decompress_generic(
>> >
>> > length += MINMATCH;
>> >
>> > +#if LZ4_FAST_DEC_LOOP
>> > +safe_match_copy:
>> > +#endif
>> > +
>> > /* match starting within external dictionary */
>> > if ((dict == usingExtDict) && (match < lowPrefix)) {
>> > if (unlikely(op + length > oend - LASTLITERALS)) {
>> >