Re: more NCR53c8xx ext2 problems.

Todd J Derr (infidel+@pitt.edu)
Wed, 21 Aug 1996 14:15:58 -0400 (EDT)


Gerard Roudier <groudier@club-internet.fr> writes:
> Does this problem only affects directory blocks or did you get some
> corrupted blocks for other disk read IO? (data files, exec files, etc...).

as far as I can tell, it's only directory errors, always on the same
directory. The access pattern of the application is this: every
night, between 1am and 7am, there are many (15-90) processes running
which continually open a new file, write some data, and close it, for
a total of 15000-20000 files per night. The errors are always on this
directory ('debug').

> The previous report I'd seen of this problem was a bad directory entry at
> offset 5632 = 5x1024+512, name_len = 1541=1024+512+5, rec_len=16.

the messages I get vary, they look like some sort of multiple bit
errors (it is hard to say since I don't know what the filename in
question is, but I do know that the filenames used are typically ~20-30
bytes long, valid range is probably 12-263 bytes though (the files
are called 'Dsome.internet.hostname.00PID').

> With memory errors or a broken CPU, you should have lots of other problems
> with your system. If the scsi cables were bads, then you should get scsi
> transfer errors, and/or scsi parity errors.

no, I see no obvious signs of any malfunction other than the errors in
the log. Machine is up 24x7 aside from an occasional reboot for the
past 5 months or so, and gets a lot of activity (web and mailing
lists)

> If you only get ext2 directory errors, it might be a software problem.
> I have looked into the ext2 fs directory code. For the moment what I
> have understood seems to be ok.

i at first suspected it may be an ext2fs problem, but since the only
people I've seen reporting the problem are using ncr53c8xx drivers, it
really looks more like that could be the culprit... but, again, it is
odd that I see no other malfunctions at all, which I would expect if
it were a driver problem.

> I was interested in all the errors messages from ext2_..._dir() you got
> and in the parameters of the ext2 partitions on which you have had dir block
> problems. Perhaps some correlations may help us to guess something.

> You should disable or enable one feature at a time and then wait and see...

I disabled FAST mode last night and did not see the errors for the
first time all week. However, I also obviously had to fsck the
partition (which found no errors) and reboot the machine, so it's too
early to tell...

if you want to look at my log of errors from the past week, you can
get it from ftp://wordsmith.org/pub/e2fs.errors.gz. Beware if you
unzip it that it's really 6.72MB. (ah, if i could only get 157:1
compression on every file :)

todd.