Re: [PATCH 2.6.2-rc3-mm1] DIO read race fix

From: Daniel McNeil
Date: Thu Feb 05 2004 - 12:57:33 EST


Andrew,

I am still thinking about your patch. I will run some tests today using
2.6.2-mm1 to see if the problem is fixed. My 8-proc machine ran
overnight with 6 copies of the read_under running without problems with
my original patch. Previously on the 8-proc machine, it would hit
uninitialized data within an hour.

The concern I have is that DIO needs filemap_write_and_wait() to
make sure all previously dirty pages have been written back to
disk before the DIO is issued.

If __block_write_full_page() can possibly clear PageWriteback
with buffer i/o still in flight (even for WB_SYNC_NONE) then
a subsequent filemap_write_and_wait() will miss that page.

For example, I previously tried:

do {
get_bh(bh);
+ if (wbc->sync_mode != WB_SYCN_NONE)
+ wait_on_buffer(bh);
if (buffer_mapped(bh) && buffer_dirty(bh)) {
if (wbc->sync_mode != WB_SYNC_NONE) {
lock_buffer(bh);

and this still saw uninitialized data.

Also, if __block_write_full_page() can redirty a page wouldn't this
allow filemap_write_and_wait() to return with page still dirty that
DIO needs written back?

I'll work on updating the other patches.

Daniel



On Wed, 2004-02-04 at 21:33, Andrew Morton wrote:
> Daniel McNeil <daniel@xxxxxxxx> wrote:
> >
> > I have found (finally) the problem causing DIO reads racing with
> > buffered writes to see uninitialized data on ext3 file systems
> > (which is what I have been testing on).
> >
> > The problem is caused by the changes to __block_write_page_full()
> > and a race with journaling:
> >
> > journal_commit_transaction() -> ll_rw_block() -> submit_bh()
> >
> > ll_rw_block() locks the buffer, clears buffer dirty and calls
> > submit_bh()
> >
> > A racing __block_write_full_page() (from ext3_ordered_writepage())
> >
> > would see that buffer_dirty() is not set because the i/o
> > is still in flight, so it would not do a bh_submit()
> >
> > It would SetPageWriteback() and unlock_page() and then
> > see that no i/o was submitted and call end_page_writeback()
> > (with the i/o still in flight).
> >
> > This would allow the DIO code to issue the DIO read while buffer writes
> > are still in flight. The i/o can be reordered by i/o scheduling and
> > the DIO can complete BEFORE the writebacks complete. Thus the DIO
> > sees the old uninitialized data.
>
> I suppose we should go for a general fix to the problem. I'm not 100%
> happy with it. It's similar to yours, except we only wait if
> wbc->sync_mode says it's a write-for-sync. Also we hold the buffer lock
> across all the tests.
>
>
>
>
>
>
> Fix a race which was identified by Daniel McNeil <daniel@xxxxxxxx>
>
> If a buffer_head is under I/O due to JBD's ordered data writeout (which uses
> ll_rw_block()) then either filemap_fdatawrite() or filemap_fdatawait() need
> to wait on the buffer's existing I/O.
>
> Presently neither will do so, because __block_write_full_page() will not
> actually submit any I/O and will hence not mark the page as being under
> writeback.
>
> The best-performing fix would be to somehow mark the page as being under
> writeback and defer waiting for the ll_rw_block-initiated I/O until
> filemap_fdatawait()-time. But this is hard, because in
> __block_write_full_page() we do not have control of the buffer_head's end_io
> handler. Possibly we could make JBD call into end_buffer_async_write(), but
> that gets nasty.
>
> This patch makes __block_write_full_page() wait for any buffer_head I/O to
> complete before inspecting the buffer_head state. It only does this in the
> case where __block_write_full_page() was called for a "data-integrity" write:
> (wbc->sync_mode != WB_SYNC_NONE).
>
> Probably it doesn't matter, because kjournald is currently submitting (or has
> already submitted) all dirty buffers anyway.
>
>
>
> ---
>
> fs/buffer.c | 29 +++++++++++++++--------------
> 1 files changed, 15 insertions(+), 14 deletions(-)
>
> diff -puN fs/buffer.c~O_DIRECT-ll_rw_block-vs-block_write_full_page-fix fs/buffer.c
> --- 25/fs/buffer.c~O_DIRECT-ll_rw_block-vs-block_write_full_page-fix 2004-02-04 20:38:30.000000000 -0800
> +++ 25-akpm/fs/buffer.c 2004-02-04 20:40:19.000000000 -0800
> @@ -1810,23 +1810,24 @@ static int __block_write_full_page(struc
>
> do {
> get_bh(bh);
> - if (buffer_mapped(bh) && buffer_dirty(bh)) {
> - if (wbc->sync_mode != WB_SYNC_NONE) {
> - lock_buffer(bh);
> - } else {
> - if (test_set_buffer_locked(bh)) {
> + if (!buffer_mapped(bh))
> + continue;
> + if (wbc->sync_mode != WB_SYNC_NONE) {
> + lock_buffer(bh);
> + } else {
> + if (test_set_buffer_locked(bh)) {
> + if (buffer_dirty(bh))
> __set_page_dirty_nobuffers(page);
> - continue;
> - }
> - }
> - if (test_clear_buffer_dirty(bh)) {
> - if (!buffer_uptodate(bh))
> - buffer_error();
> - mark_buffer_async_write(bh);
> - } else {
> - unlock_buffer(bh);
> + continue;
> }
> }
> + if (test_clear_buffer_dirty(bh)) {
> + if (!buffer_uptodate(bh))
> + buffer_error();
> + mark_buffer_async_write(bh);
> + } else {
> + unlock_buffer(bh);
> + }
> } while ((bh = bh->b_this_page) != head);
>
> BUG_ON(PageWriteback(page));
>
> _

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/