BUGHUNTING: sata_sil, Silicon Image 3114 controllers, 2.6.18 and numerous errors

From: Jonathan Bell
Date: Tue Oct 03 2006 - 15:56:13 EST


Hello,

This is a long-winded explanation of events so I'll try to keep it concise.

I am running the following system (a large file server):

Athlon XP 2600+ CPU
1GB PC2700 RAM
A7N8X (nForce2) motherboard, BIOS 1008 (latest)
2x Silicon Image SiI3114 SATA controllers, output of LSPCI in [1]lspci.gz
D-Link DG530-T Gigabit PCI network card
Twinhan VisionPlus VisionDTV TV card

Drives =

2xWD3000JD 300GB western digital
2xWD3200JD 320GB western digital

3x6L300S0 300GB maxtor
3x6B300S0 300GB maxtor (identical to above but slightly older firmware and non-RoHS compliant)

The mess of problems began with _all_ these drives whilst using kernel 2.6.15-26-k7 in the Ubuntu Dapper distribution. I got hit with a bug concerning FUA:

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/b0c495e4cf9d6d2/6ac40ae91be51b23?lnk=st&q=libpata+code+issues&rnum=1#6ac40ae91be51b23

At this stage my testing methodology was the following:

1) Make filesystems (reiserfs) on each of the drives
2) Make a huge file (11GB) by piping the output of /bin/yes "0123456789" to a file
3) Sync the disks
4) Calculate MD5 checksum of the file
5) Copy hugefile across to the next drive
6) Calculate MD5sum on new file

The writing of the hugefile to EACH drive would fill up the kernel log with errors seen all over the above linked thread, about one every 20 seconds and the MD5sums of the files copied were different.

Noting that the problem occured on both the maxtor drives (with much more severity leading to device resets) and the western digitals I manually installed the 2.6.18 kernel from kernel.org which disabled FUA by default. This made the errors "disappear" on the maxtor drives but the western digital drives still displayed errors the same as the ones before.

Because I am naiive I went out and purchased another four drives, Seagates this time. I replaced the Western Digitals in the machine with the following model: 4xST3250824NS (250GB) and started testing again. This time I get errors like:

[ 1876.112335] attempt to access beyond end of device
[ 1876.112429] sdc1: rw=0, want=4517265416, limit=488392002
[ 1876.112516] attempt to access beyond end of device
[ 1876.112588] sdc1: rw=0, want=2110783496, limit=488392002
[ 1876.112723] attempt to access beyond end of device
[ 1876.112793] sdc1: rw=0, want=4656529416, limit=488392002
[ 1876.122250] attempt to access beyond end of device
[ 1876.122339] sdc1: rw=0, want=4517265416, limit=488392002

and even a kernel Oops: [2]reiseroops.gz

These errors occured after I copied hugefile to a destination drive and during the calculation of the md5sum, i.e. upon reading the new file.

Digging deeper, I am now at the stage where I can report the following:

The errors are happening independent of the controller hardware. One brand-new controller and one used and verified working (on Windows) with the same errors happening in each case.
The errors are independent of type of drive - Seagates and Maxtors both exhibit the same errors.
The errors are independent of filesystem - I tried and got the same result with ext2, ext3 and reiserfs 3.6.
The errors ONLY occur when reading from newly-created files. I am currently badblocks -n'ing the drives which will obviously take some time on drives this large in order to find out if this does happen with simple block read/writes.

Also I can say this:

The hugefile copied from the FIRST drive on one controller to the FIRST drive on the other controller exhibited NO ERRORS in either direction. By this I mean a Seagate attached to port0 and a Maxtor attached to port0. [3]dmesg-detection.gz may help with this - it is the kernel's detection of the drives. Whether this is due to dumb luck or a quirk in this "bug", I don't know but I will keep trying to make this error happen on the first drives in the system.

The hugefile copied from the FIRST Seagate drive to the SECOND, THIRD and FOURTH Seagate drives all make md5sum say "input/output error" and associated "access beyond end of device" errors in dmesg. The same thing happens when I copy from the FIRST maxtor drive to the second and third (not fourth as it contains NTFS data).

I will keep copying back and forth between drives in an effort to map out what is causing the error, but I'm going to need some pointers to track this to the source.

Any help appreciated,
Jonathan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/