Re: [PATCH 2.6.2-rc3-mm1] DIO read race fix

From: Janet Morgan
Date: Wed Feb 04 2004 - 21:59:19 EST


Andrew Morton wrote:

Daniel McNeil <daniel@xxxxxxxx> wrote:


I have found (finally) the problem causing DIO reads racing with
buffered writes to see uninitialized data on ext3 file systems (which is what I have been testing on).



What kernel? If -mm, is this the only remaining buffered-vs-direct
problem?



I think there was consensus on two other patches along the way:

http://marc.theaimsgroup.com/?l=linux-kernel&m=107286971311559&w=2
http://marc.theaimsgroup.com/?l=linux-aio&m=107291089712224&w=2

-Janet

The problem is caused by the changes to __block_write_page_full()
and a race with journaling:

journal_commit_transaction() -> ll_rw_block() -> submit_bh()

ll_rw_block() locks the buffer, clears buffer dirty and calls
submit_bh()

A racing __block_write_full_page() (from ext3_ordered_writepage())

would see that buffer_dirty() is not set because the i/o
is still in flight, so it would not do a bh_submit()

It would SetPageWriteback() and unlock_page() and then
see that no i/o was submitted and call end_page_writeback()
(with the i/o still in flight).

This would allow the DIO code to issue the DIO read while buffer writes
are still in flight. The i/o can be reordered by i/o scheduling and
the DIO can complete BEFORE the writebacks complete. Thus the DIO
sees the old uninitialized data.



ow. How'd you work this out?



Here is a quick hack that fixes it, but I am not sure if this the
proper long term fix.



The problem is that ext3 and the VFS are using different paradigms. VFS is
all page-based, but ext3 is all block-based. One or the other needs to do
something nasty.

One approach would be to change the JBD write_out_data_locked loop to use
block_write_full_page(bh->b_page) instead of buffer_head operations. But
that could get hairy with blocksize < PAGE_SIZE.

Thanks for working this out. Let me ponder it for a bit.





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/