Re: Laptop mode causing writes to wrong sectors?

From: Bart Samwel
Date: Thu Nov 17 2005 - 04:27:36 EST


Jan Niehusmann wrote:
On Wed, Nov 16, 2005 at 09:06:03PM +0100, Bart Samwel wrote:
First of all, you having resized your fs is a smoking gun, if you ask me. Your fs is dead/dying, and you know you've recently been tinkering with it. It's the most probable cause.

Well, it would be nice if the explanation was that easy - but the recent
corruption (with the man page) was on my root partition, which is not on
LVM and has never been resized.

Youch. I assumed this was all the same fs! It is the same HD though?

Be assured that at least after the first corruption I observed, I did
forced e2fsck on all partitions. Without any errors found.

ACK.

after the resize2fs -- seeing as all the subsequent fscks were probably done by journal.

What do you mean with 'by journal'? The filesystems were unmounted (or
remounted r/o), so the journal should have been committed and empty at
e2fsck time.

I mean, no full e2fsck, at most a journal replay. This doesn't check the bitmaps, so if they are inconsistent, they stay inconsistent.

But now, I got another hint pointing to a possible cause of this
problem: I found a file - /usr/lib/libatlas.so.3.0 - which was corrupted
by 4k of it being overwritten by a different file, which I recognized. And that file happened to be an uncompressed manual page.
This sounds like your filesystem's block bitmaps are "fscked up". These problems can definitely cause "creeping corruption" when undetected,

But this should definitely have been detected by an fsck, right?

Yes. And you've had this problem before, even. Googling for "e2fsck block bitmap differences" shows me this as the third entry. :)

http://lkml.org/lkml/2003/8/3/166

If you didn't get those messages, then this is not the problem, apparently.

I agree it's probably not a sync problem. And therefore, probably not
really a laptop-mode bug, even if laptop-mode triggered the corruption.
I suspected the hard drive to mess up write requests during spin-up. Or
perhaps giving some kind of error message, which could trigger a bug in
a rarely tested error-handling path in the kernel. But the fact that you
never got similar reports makes this less likely. In the end, I have to
consider there may be some bad hardware in my laptop. (Already did a
memtest86, of course.)

There is a known problem with laptop mode where, often during spindown/spinup, the kernel emits DMA timeout errors.

http://lkml.org/lkml/2005/8/21/48

According to Andrea Gelmini (the original reporter of this problem) this can lead to system freezes on some kernels, and corruption on others. Are you seeing these errors somewhere in your logs?

What's your hardware? A Thinkpad perhaps?

--Bart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/