Re: more NCR53c8xx ext2 problems.

Gerard Roudier (groudier@club-internet.fr)
Wed, 21 Aug 1996 18:06:35 +0000 (GMT)


Hi Todd,

On Tue, 20 Aug 1996, Todd J Derr wrote:

> Hello,
>
> I had reported some e2fs corruption (errors like others have
> reported, with bogus d_reclen, d_namelen) a month or so ago on my
> system. Finding no resolution, I gave up and backed everything up,
> and re-mkfs'ed. The errors were gone for a few weeks, but now they
> are back.
>

Does this problem only affects directory blocks or did you get some
corrupted blocks for other disk read IO? (data files, exec files, etc...).
The previous report I'd seen of this problem was a bad directory entry at
offset 5632 = 5x1024+512, name_len = 1541=1024+512+5, rec_len=16.
I remember that the responses about this report were the following:
- 2 bits memory error
- broken CPU
- bad scsi cables
With memory errors or a broken CPU, you should have lots of other problems
with your system. If the scsi cables were bads, then you should get scsi
transfer errors, and/or scsi parity errors.

If you only get ext2 directory errors, it might be a software problem.
I have looked into the ext2 fs directory code. For the moment what I
have understood seems to be ok.
However I ask me some questions about the possible effects of IO
reordereing and asynchronous IO on that code.

I was interested in all the errors messages from ext2_..._dir() you got
and in the parameters of the ext2 partitions on which you have had dir block
problems. Perhaps some correlations may help us to guess something.

I donnot exclude a possible problem in the scsi subsystem (including the
driver code), however such explanation seems to me improbable. The
reason is that data block corruption should affect all kinds of data, and
not only directory data blocks.

> In the interim, I have seen others report the same problems
> with NCR53c8xx (mine is an 810, with a conner disk as the only
> device), so... basically I'm wondering where to go from here. Is
> there any resolution to this problem (i.e. twiddling config options?
> I've already disabled sync, disconnect, and fast mode... or twiddling
> the compile options that aren't available through 'make config'?)
>

You should disable or enable one feature at a time and then wait and see...

> I'd be happy to help out debugging this problem. I can try to
> find a reliable way to reproduce it, i'll try stealing some swap to

It _is_ the good way.

> make an fs on and playing with it. All I can say is that I spend a
> lot of time overnight writing smallish files in the same directory,
> which seems consistent with someone else's report (I saw someone
> running a news server...)

Good luck in your debugging.

Regards, Gerard.