Re: BUG: disk corruption 2.3.99-pre4-pre3

From: Simon Kirby (sim@stormix.com)
Date: Wed Apr 05 2000 - 19:21:24 EST


On Tue, Apr 04, 2000 at 04:19:15PM -0400, Alexander Viro wrote:

> On Tue, 4 Apr 2000, Jeff Garzik wrote:
>
> > Copying an ISO from a 2.2.15-pre17 SMP box, over a network, to a
> > 2.3.99-pre4-pre3 K6 box.
> >
> > smpbox> md5sum iso
> > smpbox> cp iso /remote/tmp
> > smpbox> md5sum iso # sum matches
> >
> > k6box> md5sum iso # sums differ
>...
>
> Hmmm.... looks like somebody moved the locking in do_exit(). WTF?
> Another problem being that shmfs calls do_mmap() without proper locking,
> but that's another story.

I assume you're talking about the task_lock in do_exit()? It seems to be
where Andi Kleen wrote a patch to move it, and CC'd the email to viro@redhat.com
(attached). The email had subject "task_lock allows denial of service [+PATCH]".

Anyway, I just wanted to mention that I'm still seeing corruption in
2.3.99pre4pre4, just a lot less than I ever used to...it's about 10x
harder than it used to be to reproduce it.

I've attached two gzipped kernel .c files which show the corruption
linux/drivers/scsi/ChangeLog was the file that became corrupted this
time. This corruption is from 2.3.99pre4pre4.

The corruption starts at offset 3072 which I suppose could indicate read
corruption because I was copying from a 1K ext2 filesystem to a 2K ext2
filesystem.

I was able to reproduce read corruption by doing nothing but reading the
same xfree source tree over and over again (via find...xargs md5sum), but
only twice out of hundreds of tries. The corruption vanished from cache
before I coud look at either case.

Again, I'm on an SMP box (2 processors) with 128 MB ECC SDRAM and three
drives (hda, hdb, hdc), copying from hdc2 (1K) to hdb2 (2K). Booting
with "nosmp" makes the problem disappear. I also saw the corruption
simply reading from hda, so the cross-major/device thing might not be
related...I saw this back in the kernel versions when I first reported
the problem and it was easier to reproduce, however, so it may have been
another problem that could have been fixed (or it could be the same
problem).

Disabling DMA mode to the drive with "hdparm -d 0" (using PIO instead)
seems to help reproduce the problem...I can't even seem to reproduce it
at all anymore with DMA enabled, but it might still happen if I left it
copying for hours.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communcations Inc. ]
[ sim@stormix.com ][ sim@netnation.com ]
[ Opinions expressed are not necessarily those of my employers. ]







-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Apr 07 2000 - 21:00:15 EST