Re: SMP 2.1.90-pre3 SCSI kernel panic

Doug Ledford (dledford@dialnet.net)
Mon, 16 Mar 1998 06:52:26 -0600 (CST)


On Mon, 16 Mar 1998 sistema@readysoft.es wrote:

> On 16 Mar, Doug Ledford wrote:
>
> >> 2.0.33 UP kernel works flawlessly. 2.0.33 SMP locks hard randomly, even
> >> with a BusLogic Flashpoint card instead of the Adaptec one.
>
> How can we explain that I have no problems with a 2.0.33 UP kernel +
> aic-5.0.7 patch? The machine does exactly the same work, but with no
> errors.

Wrong. The machine holds the same user data without errors. Changes in
the actual filesystem format or standard block allocation strategies in
2.1.x could cause these kinds of things. I don't have 2.1.x, but I'll
reference some posts from Ted recently concerning changes to the ext2fs
format. For the most part, they should be benign, but even a small change
somewhere can cause a block to be written as part of the ext2fs filesystem
under 2.1.x that wouldn't get written under 2.0.x.

> Even fscking, the machine locked up with >=2.1.89, but it completed
> right with 2.0.33 UP.

fsck doesn't attempt to read every sector on a disk unless you give it the
-c option.

> I even filled the filesystem completely and
> the system kept up and running: all disk sectors full of data.

Filling a file system != all disk sectors full of data. It's relatively
easy to have a full filesystem and still have free blocks on the disk. If
you run out of inodes, the filesystem is full as far as the system is
concerned, but you still can have free data blocks laying around that you
can't use because you don't have a free inode to attach them to. Or, you
run out of blocks but still have tons of free inodes, then those free
inodes essentially count as unused space.

In any event, these are a couple things to keep in mind when searching for
this. Now, as to why you see it on multiple partitions/locations on the
drive, I can't answer that. I would tend to agree with you that something
seems wrong, but I don't have a 2.1.x system to check it out and I don't
see anything obviously wrong. The most interesting part is that if this
*is* a kernel error somewhere, then it isn't a typical wild pointer or
something like that. It's something that causes a drive to give a
peripheral device write fault when no where near the end of the device.
It's hard to make a drive say device fault when it isn't. Typically, any
error that could cause this would have also trashed your filesystem beyond
repair long before now.

BTW, in regards to the AWRE and ARRE bits. The reason I always recommend
a low level format after enabling these bits (or after any hard sectors
errors for that matter) is that my experience has shown that several
makes/models of drives will fail to remap a bad sector once you find it.
If it can catch it before you see it, then it will take care of it
(usually this means during a write). But if it doesn't find it until
after you've stored data in it that the drive thought was good, many of
them fail to map out the bad sector and move the data regardless of the
ARRE bit when you try to read it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu