Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

From: Theodore Ts'o
Date: Fri Oct 26 2012 - 14:46:49 EST


On Fri, Oct 26, 2012 at 04:55:01PM +0000, Luck, Tony wrote:
>
> I think that we know that the file *is* corrupted, not just "potentially".
> We probably know the location of the corruption to cache-line granularity.
> Perhaps better on systems where we have access to ecc syndrome bits,
> perhaps worse ... we do have some errors where the low bits of the address
> are not known.

Well, it's at least *possible* that it was only the ECC bits that got
flipped. :-) Not likely, I'll grant! (Or does the motherboard zero
out the entire cache-line on a hard ECC failure?)

> I'm in total agreement that forcing a reboot or fsck is unhelpful here.
>
> But what should we do? We don't want to let the error be propagated. That
> could cause a cascade of more failures as applications make bad decisions
> based on the corrupted data.
>
> Perhaps we could ask the filesystem to move the file to a top-level
> "corrupted" directory (analogous to "lost+found") with some attached
> metadata to help recovery tools know where the file came from, and the
> range of corrupted bytes in the file? We'd also need to invalidate existing
> open file descriptors (or less damaging - flag them to avoid the corrupted
> area??). Whatever we do, it needs to be persistent across a reboot ... the
> lost bits are not going to magically heal themselves.

Well, we could set a new attribute bit on the file which indicates
that the file has been corrupted, and this could cause any attempts to
open the file to return some error until the bit has been cleared.
This would persist across reboots. The only problem is that system
administrators might get very confused (at least at first, when they
first run a kernel or a distribution which has this feature enabled).
Application programs could also get very confused when any attempt to
open or read from a file suddenly returned some new error code (EIO,
or should we designate a new errno code for this purpose, so there is
a better indication of what the heck was going on?)

Also, if we just log the message in dmesg, if the system administrator
doesn't find the "this file is corrupted" bit right away, they might
not be able to determine which part of the file was corrupted. How
important is this? If the file system supports extended attributes,
should we attempt to attach a new extended attribute with information
about the ECC failure?

I'm not sure it's worth it to go to these extents, but I could imagine
some customers wanting to have this sort of information. Do we know
what their "nice to have" / "must have" requirements might be?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/