RE: [PATCH 2/3] ext4: introduce ext4_error_remove_page

From: Luck, Tony
Date: Fri Oct 26 2012 - 18:34:18 EST


> Well, we could set a new attribute bit on the file which indicates
> that the file has been corrupted, and this could cause any attempts to
> open the file to return some error until the bit has been cleared.

That sounds a lot better than renaming/moving the file.

> This would persist across reboots. The only problem is that system
> administrators might get very confused (at least at first, when they
> first run a kernel or a distribution which has this feature enabled).

Yes. This would require some education. But new attributes have been
added in the past (e.g. immutable) that caused confusion to users and
tools that didn't know about them.

> Application programs could also get very confused when any attempt to
> open or read from a file suddenly returned some new error code (EIO,
> or should we designate a new errno code for this purpose, so there is
> a better indication of what the heck was going on?)

EIO sounds wrong ... but it is perhaps the best of the existing codes. Adding
a new one is also challenging too.

> Also, if we just log the message in dmesg, if the system administrator
> doesn't find the "this file is corrupted" bit right away

This is pretty much a given. Nobody will see the message in the console log
until it is far too late.

> I'm not sure it's worth it to go to these extents, but I could imagine
> some customers wanting to have this sort of information. Do we know
> what their "nice to have" / "must have" requirements might be?

18 years ago Intel rather famously attempted to sell users on the idea that a
rare divide error that sometimes gave the wrong answer could be ignored. Before
my time at Intel, but it is still burned into the corporate psyche that customers
really don't like to get the wrong answers from their computers.

Whether it is worth it may depend on the relative frequency of data being
corrupted this way, compared to all the other ways that it might get messed
up. If it were a thousand times more likely that data got silently corrupted
on its path to media, sitting spinning on the media, and then back off the
drive again - then all this fancy stuff wouldn't make any real difference.
I have no data on the relative error rates of memory and i/o - so I can't
answer this.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/