Re: Mild filesystem corruption on ext4 (no journal)

From: Theodore Tso
Date: Fri Jun 05 2009 - 14:01:44 EST


On Fri, Jun 05, 2009 at 05:40:33PM +0300, Aioanei Rares wrote:
>> When I upgrade libc from 2.7 (debian stable) to 2.9 (debian unstable),
>> the locale breaks every reboot, and I have to repair it by running
>> locale-gen. This happened now when I only upgraded libc, in order to
>> play with signalfd(). It also happened before, when I upgraded the
>> entire machine to debian unstable (which I later reverted).
>>
>> The problem is that /usr/lib/locale/locale-archive gets corrupted when
>> I reboot. The exact corruption differs with each reboot (i.e. the
>> md5sum differs). Last time, the first ~70K was overwritten with data
>> from xorg.log and my web browsing history. I have copies of the
>> original and corrupted state which I can send, the full file is 1.3
>> megs, but I can limit it to the first 70K, since that's all that was
>> corrupted.

> I suspect, although I might be wrong, that this is not a kernel-related
> problem.

Actually, I suspect it is indeed a kernel-related problem. The
problem has been reported before, with a repeatable test case:

http://bugzilla.kernel.org/show_bug.cgi?id=13292

The problem shows up after you unmount and remount the filesystem.
Before you the filesystem is unmounted, the locale-archive file has
the correct md5sum. After you unmount and remount the filesystem, the
filesystem is corrupted. I'm guessing that some data blocks aren't
getting marked as needing writeback, so the previous contents on disk
aren't written back. I was able to show that even though the mounted
filesystem had the correct information, direct access to the disk
using debugfs showed the blocks on disk had the contents that would be
revealed after the filesystem was unmounted and remounted.

The problem only shows up when using ext4 without a journal, and I was
never able to create a simpler reproduction case. The last time I
tried to work on this bug was approximately a month ago. About two
weeks ago Frank from Google tried reproducing it, but he wasn't able
to do so using his 2.6.26-based kernel plus an updated ext4.
Unfortunately, I haven't had time to look at it since then, or to
check to see if some of the more recent patches scheduled for the
2.6.31 merge window might have changed the behaviour of this bug.

- Ted





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/