Re: Massive e2fs corruption with 2.2.9/10?

Harald Koenig (koenig@tat.physik.uni-tuebingen.de)
Thu, 17 Jun 1999 14:30:51 +0200


On Jun 16, Philip Gladstone wrote:

>
>
> roel@grobbebol.xs4all.nl wrote:
> > I have had an ext2 crash that also took parts of a mounted DOS filesystem
> > with 2.2.9 but so far not many poeple have had this. It's in my case an IDE
> > disk that dies; SCSI kept working fine.
>
> We are getting strange disk corruptions on 2.2.9 -- well actually I think
> that they are buffer cache corruptions. The odd bit gets flipped in files
> -- this is not noticeable until some program segfaults that used to work.
> Running rpm --verify reveals that it has suffered corruption. If I restore
> the files with rpm --upgrade --force and then immediately do an rpm --verify,
> then sometimes some of the files are corrupt. I doubt that it is a disk
> problem as the system has a lot of memory and is pretty idle.

interesting to read this. I have pretty similar problems right now:

two times I got strange ext2fs errors -- there was a duplicate block
both times. for the 2nd error I kept the fsck log:

Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 196693: 787271
Duplicate/bad block(s) in inode 196854: 787271
Duplicate/bad block(s) in inode 418198: 787271
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 3 inodes containing duplicate/bad blocks.)

and now recently I got crashes in two programs at runtime (1st mutt, later emacs).
in both cases `rpm -V package' showed `..5.....' and trashing the buffer cache
`fixed' the problem.
for emacs, I kept a copy of the bad image/buffer before `fixing' and here I get

# cmp -l /usr/bin/emacs.good /usr/bin/emacs.bad
300137 355 255

(so also only one bit flipped). I'm not sure if it's 2.2.9/2.2.10 or maybe
it's a hardware problem, because it just started when I changed hardware
(CPU, mainboard, memory) to AMD K6-2-450 with 128MB (DFI main board).
but now reading similar reports, maybe it's not my hardware ?!

did you change your hardware [settings?] recently ?

> Initially I suspected the memory (384Megs of PC100 SDRAM), but we
> have been running memtest86 for several hours with no errors. We'll
> leave it running overnight to see what happens.
>
> Any other ideas on tests? I guess we could go back to 2.0.36 and see
> if the problem goes away......
>
> Philip
> --
> Philip Gladstone +1 781 530 2461
> Axent Technologies, Waltham, MA

Harald

--
All SCSI disks will from now on                     ___       _____
be required to send an email notice                0--,|    /OOOOOOO\
24 hours prior to complete hardware failure!      <_/  /  /OOOOOOOOOOO\
                                                    \  \/OOOOOOOOOOOOOOO\
                                                      \ OOOOOOOOOOOOOOOOO|//
Harald Koenig,                                         \/\/\/\/\/\/\/\/\/
Inst.f.Theoret.Astrophysik                              //  /     \\  \
koenig@tat.physik.uni-tuebingen.de                     ^^^^^       ^^^^^

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/