Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails

From: Zach Brown
Date: Tue Dec 11 2007 - 20:00:59 EST


Hisashi Hifumi wrote:
> Hi.
>
> Current dio has some problems:
> 1, In ext3 ordered, dio write can return with EIO because of the race
> between invalidation of
> a page and jbd. jbd pins the bhs while committing journal so
> try_to_release_page fails when jbd
> is committing the transaction.

Yeah. It sure would be fantastic if some ext3 expert could stop this
from happening somehow. But that hasn't happened in.. uh.. Badari, for
how many years has this been on the radar? :)

>
> Past discussion about this issue is as follows.
> http://marc.info/?t=119343431200004&r=1&w=2
> http://marc.info/?t=112656762800002&r=1&w=2
>
> 2, invalidate_inode_pages2_range() sets ret=-EIO when
> invalidate_complete_page2()
> fails, but this ret is cleared if do_launder_page() succeed on a page of
> next index.

Oops. That's too bad. So maybe we should fix it by not stomping on
that return code?

ret2 = do_launder()
if (ret2 == 0)
ret2 = invalidate()
if (ret == 0)
ret = ret2

I'd be surprised if we ever wanted to mask an -EIO when later pages
laundered successfully.

> In this case, dio is carried out even if invalidate_complete_page2()
> fails on some pages.
> This can cause inconsistency between memory and blocks on HDD because
> the page
> cache still exists.

Yeah.

> I solved problems above by introducing invalidate_inode_pages3_range()
> and falling
> through to buffered I/O when invalidation of a page failed.

Well, I like the idea of more intelligently dealing with the known
problem between dio and ext3. I'm not sure that falling back to
buffered is right.

If dio can tell that it only failed because jbd held the buffer, should
we have waited for the transaction to complete before trying to
invalidate again?

If we could do that, we'd avoid performing the IO twice.

> We can distinguish between failure of page invalidation and other errors
> with the return value of invalidate_inode_pages3_range().

I'm not sure duplicating the invalidation loop into a new function is
the right thing. Maybe we'd just tweak inode_pages2 to indicate to the
caller the specific failing circumstances somehow. Maybe.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/