Re: [PATCH 00/22] HWPOISON: Intro (v5)

From: Andi Kleen
Date: Mon Jun 15 2009 - 11:16:25 EST


> Everyone I knew in the business end of deploying Linux turned on panics
> for I/O errors, reboot on panic and all the rest of those.

oops=panic already implies panic on all machine check exceptions, so they will
be fine then (assuming this is the best strategy for availability
for them, which I personally find quite doubtful, but we can discuss this some
other time)

> Really - so if your design is wrong for the way PPC wants to work what
> are we going to do ? It's not a requirement that PPC64 support is there

Then we change the code. Or if it's too difficult don't support their stuff.
After all it's not cast in stone. That said I doubt the PPC requirements will
be much different than what we have.

> I'd guess that zSeries has some rather different views on how ECC
> failures propogate through the hypervisors for example, including the
> fact that a failed page can be unfailed which you don't seem to allow for.

That's correct.

That's because unpoisioning is quite hard -- you need some kind
of synchronization point for all the error handling and that's
the poisoned page and if it unposions itself then you need
some very heavy weight synchronization to avoid handling errors
multiple time. I looked at it, but it's quite messy.

Also it's of somewhat dubious value.

>
> (You can unfail pages on x86 as well it appears by scrubbing them via DMA
> - yes ?)

Not architectually. Also the other problem is not just unpoisoning them,
but finding out if the page is permenantly bad or just temporarily.

-Andi
--
ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/