Re: [PATCH 15/39] mtd: nand: denali: improve readability of handle_ecc()

From: Boris Brezillon
Date: Fri Dec 02 2016 - 02:55:59 EST


On Fri, 2 Dec 2016 13:26:27 +0900
Masahiro Yamada <yamada.masahiro@xxxxxxxxxxxxx> wrote:

> Hi Boris,
>
>
> 2016-11-28 0:42 GMT+09:00 Boris Brezillon <boris.brezillon@xxxxxxxxxxxxxxxxxx>:
> >> + if (err_byte < ECC_SECTOR_SIZE) {
> >> + struct mtd_info *mtd =
> >> + nand_to_mtd(&denali->nand);
> >> + int offset;
> >> +
> >> + offset = (err_sector * ECC_SECTOR_SIZE + err_byte) *
> >> + denali->devnum + err_device;
> >> + /* correct the ECC error */
> >> + buf[offset] ^= err_correction_value;
> >> + mtd->ecc_stats.corrected++;
> >> + bitflips++;
> >
> > Hm, bitflips is what is set in max_bitflips, and apparently the
> > implementation (which is not yours) is not doing what the core expects.
> >
> > You should first count bitflips per sector with something like that:
> >
> > bitflips[err_sector]++;
> >
> >
> > And then once you've iterated over all errors do:
> >
> > for (i = 0; i < nsectors; i++)
> > max_bitflips = max(bitflips[err_sector], max_bitflips);
>
>
> I see.
>
> For soft ECC fixup, we can calculate bitflips
> for each ECC sector, so I can fix the max_bitflips
> as the core framework expects.
>
> For hard ECC fixup, the register only reports
> the number of corrected bit-flips
> in the whole page (sum from all ECC sectors).
> We cannot calculate max_bitflips, I think.
>

That's unfortunate. This means you'll return -EUCLEAN more quickly
(which will trigger UBI eraseblock move), since the NAND framework is
basing its 'too many bitflips' detection logic on the max_bitflips per
ECC chunk and the bitflips threshold (by default 3/4 of the ECC
strength).

That doesn't mean it won't work, you'll just wear your NAND more
quickly :-(.

ITOH, doing max_bitflips = nbitflips / nsteps is not good either,
because the bitflips might be all concentrated in the same ECC chunk,
and in this case you really want to return -EUCLEAN.

>
>
> BTW, I noticed another problem of the current code.
>
> buf[offset] ^= err_correction_value;
> mtd->ecc_stats.corrected++;
> bitflips++;
>
> This code is counting the number of corrected bytes,
> not the number of corrected bits.
>
>
> I think multiple bit-flips within one byte can happen.

Yes.

>
>
> Perhaps, we should add
>
> hweight8(buf[offset] ^ err_correction_value)
>
> to ecc_stats.corrected and bitflips.
>

Looks good.