Bug in 2.4 kernel?: UFS file read problem

From: Clay Claiborne (cjc@cosmoseng.com)
Date: Fri Jun 09 2000 - 01:23:20 EST


Perhaps you have seen my earlier postings on this problem. I have been
tasked with mount r/w DEC OSF partitions from a Linux box. The problem
that I ran into was the failure to be able to read files pass the direct
blocks (96K), and scrambled data on writes.

The good news is that I have solved my immediate problem.
The bad new is that I may have uncovered a bug in the 2.4.0 test kernel.

UFS support is available in the 2.2 kernel but I have been working with
the 2.3.99preX, and more recently the 2.4.0-test1 kernel because they
had the necessary support for the OSF partition type. I am neither a C
programmer or a kernel hacker, so this has hindered me in figuring out
what is going on At first I assumed that there was something weird about
the DEC OSF file system that the Linux implementation of ufs wasn't
getting. But there is nothing weird about the way DEC is doing things.
Its all very straight forward inode organization. The more I looked into
it, the more it seemed like the ufs code was just broken. Now I am
convinced it is.

I think I've got the problem narrowed down to the bh->b_data pointer in
fs/ufs/in.ode.c moving around when it shouldn't. I think there is a
"failure to communicate" between the ufs fs presentation of the inode
and the VFS reading of it and something is fouling the reading of the
1st indirect block. Then I see that fs/inode.c has been extensively
reworked in the 2.3.X series, so I wonder ...

I patch the 2.2.15 tree with support for the OSF partition (just dropped
the code in from 2.4.0-test1) and build a kernel. Voila, my problem
disappears! I can read a 97MB file right off the DEC OSF.

Conclusion: Something is broken that was working, and it may effect more
than the ufs filesystem. I will investigate further but some other
people might want to look at this.

On another subject I've also experienced 'unresolved symbol' problems
with the ne2k-pci module in 2.3.99preX and 2.4.0-test1 kernels.

I haven't copied extensive materials from my earlier correspondence on
the ufs problem below. I hope they can be useful in helping to sort this
out.

--

Clay J. Claiborne, Jr., President

Cosmos Engineering Company 1550 South Dunsmuir Ave. Los Angeles, CA 90019

(323) 930-2540 (323) 930-1393 Fax

http:www.CosmosEng.com

Email: cjc@cosmoseng.com

==================================================== General Dynamics has given me a contract to build a Linux server that can mount a Digital Unix SCSI II drive and make the files available to Windows workstations for reading and deleting via samba. They have provided me with three sample drives, all 9.1 GB Quantum Atlas II’s with

50 pin SCSI interface.

This is a commercial contract so there is money for you if you can help me solve this Beyond that I think it would be good for Linux. . General Dynamics is looking for Linux solutions because the government has mandated it. I want to show them that Linux can deliver.

That being said, on to the technical details and problems.

My workstation is a basic Pentium III w’ 128MB Ram & 18GB LVD system drive. It is running RH6.2 and for this job I am working with the 2.3.99pre8 kernel.

The sample drives appear to have the OSF partition type and the UFS file

system. (DEC OSF/1 ?). With CONFIG_UFS_FS, CONFIG_UFS_FS_WRITE and CONFIG_OSF_PARTITION set in the kernel, I can mount the disk with a command like:

mount –t ufs –o ufstype=sun /dev/sdb3 /du1

Then I can read the dir & inode table (ls –l) and can read short files okay, but long files get truncated at 98304 bytes. ( 192 * 512 = 98304 ). This has been the case on two different drives and two different files, and are the only samples I have that are longer than 98304.

The command:

cp –v /du2/oilstock .

returns the error:

cp: /du2/oilstock.tar: Input/output error

and in /var/log/messages:

May 30 05:43:46 GD-DU kernel: attempt to access beyond end of device May 30 05:43:46 GD-DU kernel: 08:23: rw=0, want=536934401, limit=8890760

May 30 05:43:46 GD-DU kernel: attempt to access beyond end of device May 30 05:43:46 GD-DU kernel: 08:23: rw=0, want=536934402, limit=8890760

..

Those ‘want’ numbers look way out of line to me. Is something getting screwed up in the way the block numbers are being read?

I can remount the drive rw. This doesn’t change the above read problem. However I can write to the ufs drive, and write and read back long files, like my sample linux-2.3.99pre8.tar.gz (20MB)

So I have a problem reading long files that are native (but not long files written by Linux).

Looking at ufs_fs.h it appears UFS_BLOCK_SIZE = 8192. What does UFS_NDADDR = 12 mean? Does it mean 12 block addresses directly in the inode because 12 * 8192 = 98304, which allows the rather tidy answer that the system is getting the direct addressing right but mis-reading the indirect addresses written on the DEC system, while reading its own writing of those addresses fine.

Could this problem be related to the fact that the fs and files were written on a 64 bit system (Alpha, I assume) were as I am working on a 32 bit system (i386)? If that is the case how do I fix it?

Any light you can throw on this problem will be much appreciated.

--

I added the following observations to my earlier letter and posted it to

linux-kernel: Does this sound like I'm on the right track?

Looking at ufs_fs.h it appears UFS_BLOCK_SIZE = 8192. What does UFS_NDADDR = 12 mean? Does it mean 12 block addresses directly in the inode because 12 * 8192 = 98304, which allows the rather tidy answer that the system is getting the direct addressing right but mis-reading the indirect addresses written on the DEC system, while reading its own writing of those addresses fine.

Could this problem be related to the fact that the fs and files were written on a 64 bit system (Alpha, I assume) where as I am working on a 32 bit system (i386)? If that is the case how do I fix it?

Clay -----

Thanks again. I did a grep of the source tree for UFS_NDADDR and turned up inode.c. 12 block addresses are listed in the inode, with 3 indirect addresses. So I figure I'm on the right track too. The indirect addressing is being mishandled some home like "bytes swapped filesystems" or maybe a signed - insigned integer problem?

Next clue to investigate:

May 30 05:43:46 GD-DU kernel: attempt to access beyond end of device May 30 05:43:46 GD-DU kernel: 08:23: rw=0, want=536934401, limit=8890760

May 30 05:43:46 GD-DU kernel: attempt to access beyond end of device May 30 05:43:46 GD-DU kernel: 08:23: rw=0, want=536934402, limit=8890760

Those numbers 536934401, 536934402, etc Obviously they don't refer to any blocks on this disk, but there do they come from? I think that if I can understand how they were generated, I will know exactly what the problem is.

536934401 =

2000 F801 hex

100000000000001111100000000001

1308676867 =

4E00 D303 hex

1001110000000001101001100000011

Clay

--------------- More info on my problem:

Upon further study my original declaration that I could write files to the ufs partition and read them back was in error. It is an artifact of some level of buffering in ram because such correct read back does not survive a reboot. i.e.:

cp linux-2.3.99pre8.tar.gz /du2 cp -v /du2/linux-2.3.99-pre8.tar.gz temp1.tgz tar -tzvf temp1.tgz - okay

reboot

cp -v /du2/linux-2.3.99-pre8.tar.gz temp2.tgz tar -tzvf temp2.tgz - NOT OKAY tar -tzvf temp1.tgz - okay

In fact it appears that a file written to the ufs and then read back will have the first 2KB truncated - which is to say that the first direct addr in the inode points to a block of data that was 2KB down in the original file, and only the first 1KB of each 8KB block contains valid data, the other 7KB is nulls.

How is a proper read back possible before a reboot? This is a 20MB file. Is it possible for a file the large to remain in cache on a 128MB machine? Or is it just a proper inode table that is being cached?

I've developed a technique for isolating a specific inode on the raw partition, and this has helped considerably. I realized that since I could write to the partition, that I could make changes and look for the change.

I know that the uid is 4th from the start of the inode, so i change the files owner and look for the change. The commands look like this:

od -w4 -Ax -x -j 0 -N 10000 /dev/sdc3 >DU2-help-root chown xfs /du2/Configure.help sync od -w4 -Ax -x -j 0 -N 10000 /dev/sdc3 >DU2-help-xfs diff DU2-help-root DU2-help-xfs

>From the output of the diff I determine the inodes offset and then:

od -w4 -Ax -x -j 33792 -N 128 /dev/sdc3|less

Here's my inode. Up until now I only knew inodes as elusive structures that lived on the disk somewhere. I knew vaguely that "the inode is the file" as SUN might say, but I never knew quite what they did or how they worked. Well all that is changing now. Now I'm printing inodes out, saving them to files (I'll note here that if you save all your inodes to files you won't have room for anything else) and coloring them in.

Anyway, from a study of its inode, this is what I've discovered about the big file - oilstock.tar that won't copy more that 96K:

1) Fortunately the first 12 blocks are sequential. These are the direct address blocks. If the offset of the first block is 800 then

dd if=/dev/sdc3 of=crudeoil.tar bs=1k count=96 skip=800

will create a file that is identical to one produced by

cp /du2/oilstock.tar .

2.) The 13th addr in the inode is the 1st indirect block. I can go to that block by using that address as a simple 32 bit offset of 1K blocks from the beginning of the dev. I can take the first 32 bit address I find there and use it as the offset of 1K blocks form the beginning of dev and pick up my data were it left off. All very simple straight forward such. No byte swaps, no shifts, no new math. Fortunately also oilstock.tar is a ball of text files so its easy to see if the puzzle pieces are all there and in the right order.

So now I can read more of the unreadable file, atleast with dd and od.

My problem is that I can't read the source code well enough to understand what Linux is doing and where it is getting it wrong.

This where I need your help

-------------------------------------- I modified fs/ufs/inode.c as follows to print out some variables - this section starts around line 80 _

#define ufs_inode_bmap(inode, nr) \ (SWAB32((inode)->u.ufs_i.i_u1.i_data[(nr) \ >> uspi->s_fpbshift]) + ((nr) & uspi->s_fpbmask))

static inline unsigned int ufs_block_bmap (struct buffer_head * bh, unsigned nr, struct ufs_sb_private_info * uspi, unsigned swab) { unsigned int tmp, d1, d2, d3, d4;

UFSD(("ENTER, nr %u\n", nr)) if (!bh) return 0; tmp = SWAB32(((u32 *) bh->b_data)[nr >> uspi->s_fpbshift]) \ + (nr & uspi->s_fpbmask);

d1 = SWAB32(((u32 *) bh->b_data)[0 >> uspi->s_fpbshift]) \ + (0 & uspi->s_fpbmask);

d2 = SWAB32(((u32 *) bh->b_data)[8 >> uspi->s_fpbshift]) \ + (8 & uspi->s_fpbmask);

d3 = SWAB32(((u32 *) bh->b_data)[16 >> uspi->s_fpbshift]) \ + (16 & uspi->s_fpbmask);

d4 = SWAB32(((u32 *) bh->b_data)[ 24 >> uspi->s_fpbshift]) \ + (24 & uspi->s_fpbmask); printk("bh->b_data = %d \n", bh->b_data); printk("d1=0 %d d2=8 %d d3=16 %d d4=24 %d \n", d1, d2, d3, d4); printk(" s_fpbshift %d _fpbmask %u \n", \ SWAB32(uspi->s_fpbshift), SWAB32(uspi->s_fpbmask)); UFSD(("EXIT, result %u\n", tmp)) brelse (bh); return tmp; }

This is what I got trying to read oilstocks.tar:

Jun 8 02:20:59 GD-DU kernel: ino 5 mode 0100644 nlink 1 uid 0 gid 15 size 91555840 blocks 0 Jun 8 02:20:59 GD-DU kernel: db <800 808 816 824 832 840 848 856 864 872 880 888> Jun 8 02:20:59 GD-DU kernel: gen 951759770 ib <24328 48648 0> Jun 8 02:20:59 GD-DU kernel: (inode.c, 686), ufs_read_inode: EXIT Jun 8 02:21:10 GD-DU kernel: (inode.c, 92), ufs_block_bmap: ENTER, nr 0 Jun 8 02:21:10 GD-DU kernel: bh->b_data = -1054302208 Jun 8 02:21:10 GD-DU kernel: d1=0 24320 d2=8 24400 d3=16 24408 d4=24 24344 <- these are the first four indirect blocks for the file! Jun 8 02:21:10 GD-DU kernel: s_fpbshift 3 _fpbmask 7 Jun 8 02:21:10 GD-DU kernel: (inode.c, 113), ufs_block_bmap: EXIT, result 24320 Jun 8 02:21:10 GD-DU kernel: (inode.c, 92), ufs_block_bmap: ENTER, nr 2040 Jun 8 02:21:10 GD-DU kernel: bh->b_data = -1054301184 Jun 8 02:21:10 GD-DU kernel: d1=0 39440 d2=8 39448 d3=16 39456 d4=24 39464 Jun 8 02:21:10 GD-DU kernel: s_fpbshift 3 _fpbmask 7 Jun 8 02:21:10 GD-DU kernel: (inode.c, 113), ufs_block_bmap: EXIT, result 41480 Jun 8 02:21:10 GD-DU kernel: TER, nr 2040 Jun 8 02:21:10 GD-DU kernel: bh->b_data = -1054301184 Jun 8 02:21:10 GD-DU kernel: d1=0 39440 d2=8 39448 d3=16 39456 d4=24 39464 Jun 8 02:21:10 GD-DU kernel: s_fpbshift 3 _fpbmask 7 Jun 8 02:21:10 GD-DU kernel: (inode.c, 113), ufs_block_bmap: EXIT, result 41480 Jun 8 02:21:10 GD-DU kernel: (inode.c, 92), ufs_block_bmap: ENTER, nr 78 Jun 8 02:21:10 GD-DU kernel: bh->b_data = -1054300160 Jun 8 02:21:10 GD-DU kernel: d1=0 1308676863 d2=8 654368511 d3=16 1090560511 d4=24 956342015 Jun 8 02:21:10 GD-DU kernel: s_fpbshift 3 _fpbmask 7 Jun 8 02:21:10 GD-DU kernel: (inode.c, 113), ufs_block_bmap: EXIT, result 352322054 Jun 8 02:21:10 GD-DU kernel: (inode.c, 92), ufs_block_bmap: ENTER, nr 79 Jun 8 02:21:10 GD-DU kernel: bh->b_data = -1054302208 Jun 8 02:21:10 GD-DU kernel: d1=0 24320 d2=8 24400 d3=16 24408 d4=24 24344 Jun 8 02:21:10 GD-DU kernel: s_fpbshift 3 _fpbmask 7 Jun 8 02:21:10 GD-DU kernel: (inode.c, 113), ufs_block_bmap: EXIT, result 24399 Jun 8 02:21:10 GD-DU

Why does bh->b_data change? Shouldn't it stay the same?

Clay

--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 21:00:17 EST