Re: [PATCH] [13/16] HWPOISON: The high level memory error handlerin the VM v3

From: Wu Fengguang
Date: Tue Jun 02 2009 - 09:54:24 EST


On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > But you just said that you try to intercept the IO. So the underlying
> > > data is not necessarily corrupt. And even if it was then what if it
> > > was reinitialized to something else in the meantime (such as filesystem
> > > metadata blocks?) You'd just be introducing worse possibilities for
> > > coruption.
> >
> > The IO interception will be based on PFN instead of file offset, so it
> > won't affect innocent pages such as your example of reinitialized data.
>
> OK, if you could intercept the IO so it never happens at all, yes
> of course that could work.
>
> > poisoned dirty page == corrupt data => process shall be killed
> > poisoned clean page == recoverable data => process shall survive
> >
> > In the case of dirty hwpoison page, if we reload the on disk old data
> > and let application proceed with it, it may lead to *silent* data
> > corruption/inconsistency, because the application will first see v2
> > then v1, which is illogical and hence may mess up its internal data
> > structure.
>
> Right, but how do you prevent that? There is no way to reconstruct the
> most updtodate data because it was destroyed.

To kill the application ruthlessly, rather than allow it go rotten quietly.

> > > You will need to demonstrate a *big* advantage before doing crazy things
> > > with writeback ;)
> >
> > OK. We can do two things about poisoned writeback pages:
> >
> > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> > trigger further machine checks
>
> 1b) At which point, you invoke the end-io handlers, and the page is
> no longer writeback.
>
> > 2) to isolate them from page cache, thus preventing possible
> > references in the writeback time window
>
> And then this is possible because you aren't violating mm
> assumptions due to 1b. This proceeds just as the existing
> pagecache mce error handler case which exists now.

Yeah that's a good scheme - we are talking about two interception
scheme. Mine is passive one and yours is active one.

passive: check hwpoison pages at __generic_make_request()/elv_next_request()
(the code will be enabled by an mce_bad_io_pages counter)

active: iterate all queued requests for hwpoison pages

Each has its merits and complexities.

I'll list the merits(+) and complexities(-) of the passive approach,
with them you automatically get the merits of the active one:

+ works on generic code and don't have to touch all deadline/as/cfq elevators
- the wait_on_page_writeback() puzzle because of the writeback time window

+ could also intercept the "cannot de-dirty for now" pages when they
eventually go to writeback IO
- have to avoid filesystem references on PG_hwpoison pages, eg.
- zeroing partial EOF page when i_size is not page aligned
- calculating checksums


> > > > Now it's obvious that reusing more code than truncate_complete_page()
> > > > is not easy (or natural).
> > >
> > > Just lock the page and wait for writeback, then do the truncate
> > > work in another function. In your case if you've already unmapped
> > > the page then it won't try to unmap again so no problem.
> > >
> > > Truncating from pagecache does not change ->index so you can
> > > move the loop logic out.
> >
> > Right. So effectively the reusable function is exactly
> > truncate_complete_page(). As I said this reuse is not a big gain.
>
> Anyway, we don't have to argue about it. I already send a patch
> because it was so hard to do, so let's move past this ;)
>
>
> > > > Yes it's kind of insane. I'm interested in reasoning it out though.
>
> Well with the IO interception (I missed this point), then it seems
> maybe no longer so insane. We could see how it looks.

OK.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/