Re: Erroneous data with ext2fs

Richard B. Johnson (root@analogic.com)
Sun, 9 Mar 1997 18:47:37 -0500 (EST)


On Sun, 9 Mar 1997, Ingo Molnar wrote:

>
> > 525807650 41 0
> > 525807651 10 0
> >
> > As you may notice, from 525807618 on, it looks like a 4-byte integer in
> > lsb order which counts up.
>
> the 1024 bytes boundary is at 525807616. This means the following integer
> is counted up:
>
> 0 2 200 206
> 0 2 200 207
> 0 2 200 208
> [...]
>
> additionally, your analysis shows that errors only occur after ~480M data
> written, and >never< before. At first sight this rules out ext2fs as an
> error source (but see below). Your device is 630M large.
>
> as the corrupted data has the above regularity, block buffer corruption
> can be ruled out too.
>
> Back to the ext2fs issue, AFAIK there is only one physical ext2fs metadata
> that has the above structure, namely 'inode data double indirection
> blocks'. Assuming that the corrupted data is a double indirection block
> and converting the above number back into block number leads to block
> #182478 [plausible]. The erroneous block is ~ 525807616/1024=513484.
>
> Such large distance between a block and a (possibly related) inode
> metadata block is quite strange. [default ext2fs behaviour for your
> sequentially created zero file is allocating blocks in a continous manner,
> and adding one indirection block every ... 256K ]. Thus for a 'correct
> file', the indirection block containing 182478,182479,182480,... should be
> near block 182478 +- 256K, but not block ~513484.
>
> i dont understand how this metadata block could get up there. Maybe people
> with more ext2fs knowledge know the answer.

Let me add some research. I have a spare SCSI Disk. Therefore, rather than
erasing it which takes some time, I did....

# mke2fs /dev/sdc1

I had to do this several times because I kept getting a "Can't get a free
page" error from the kernel and the process would hang. I got sick of
trying to get a fre page so I did....

# kill -TERM -1
# ifconfig eth0 down
# ifconfig lo down
# kill -KILL -1
# sync
# umount -a
# mount -n -o remount /

Okay, so that got me a minimum system. Now I had a free page (or two).

Eventually I made a new file-system on the spare partition of the spare
disk. I mounted it off /mnt and...

# cp /dev/zero /mnt/ZERO

I let this run until I had the parition 99 percent full.

Then I did:

# cmp -l /dev/zero /mnt/ZERO &>RESULTS &

# head RESULTS

731162625 0 17
731162626 0 17
731162627 0 17
731162628 0 17
731162629 0 17
731162630 0 17
731162631 0 17
731162632 0 17
731162633 0 17
731162634 0 17

This made a very large file (over 4 megabytes) of differences! The
Differences started at the byte-offset shown above.

# tail RESULTS

828074999 0 12
828075000 0 11
828075001 0 10
828075002 0 7
828075003 0 6
828075004 0 5
828075005 0 4
828075006 0 3
828075007 0 2
828075008 0 1

So there DOES seem to be something wrong with the e2fs presently.

Now, I wondered if it was the SCSI driver, rather than the file-system
Therefore, I proceeded as follows:

# umount /mnt
# cp /dev/zero /dev/sdc1 # Copy to the raw partition.

The results were * S P E C T A C U L A R * (don't try this at home)!

There was a resounding crash with the screen attributes being written
so my terminal was lit up like a Christmas Tree. The "bell" came on
and stayed on. All the LEDS on my external modem came to life with
continuous data being sent, plus the Num-Lock on my keyboard started
flashing at about 1-second intervals. It was the most spectacular
crash I had ever seen.

Strangely the system was still alive (sort of). It responded to
Ctrl-Alt-Del, ran shutdown and rebooted.

It started normally. The first 1456 (0 to 1455) blocks on /dev/scd1 DID get
written. However, everything after that is part of the old file-syetem.

Undaunted, I decided to copy directly to the raw device, rather than
the partition (screw the partition table).. I went back to a minumum
system as before then....

# cp /dev/zero /dev/sdc

This resulted in another crash, but it wasn't very interesting. I use
the BusLogic controller on this machine.

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: Quantum Model: XP32150W Rev: L912
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: Quantum Model: XP32150W Rev: L912
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: TOSHIBA Model: CD-ROM XM-3601TA Rev: 1885
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 05 Lun: 00
Vendor: CONNER Model: CTT8000-S Rev: 1.17
Type: Sequential-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
Vendor: QUANTUM Model: FIREBALL_TM1280S Rev: 300N
Type: Direct-Access ANSI SCSI revision: 02

The device I tried to write to is the QUANTUM "FIREBALL".

Enough research for today.

Cheers,
Dick Johnson
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard B. Johnson
Project Engineer
Analogic Corporation
Voice : (508) 977-3000 ext. 3754
Fax : (508) 532-6097
Modem : (508) 977-6870
Ftp : ftp@boneserver.analogic.com
Email : rjohnson@analogic.com, johnson@analogic.com
Penguin : Linux version 2.1.28 on an i586 machine (66.15 BogoMips).
Warning : It's hard to remain at the trailing edge of technology.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-