Re: [PATCH] Possible data loss on ext[34], reiserfs with external journal

From: Oleg Drokin
Date: Tue Dec 15 2009 - 15:51:26 EST


Hello!

On Dec 15, 2009, at 11:45 AM, tytso@xxxxxxx wrote:
>>> + /*
>>> + * If the journal is not located on the file system device,
>>> + * then we must flush the file system device before we issue
>>> + * the commit record
>>> + */
>>> + if (commit_transaction->t_flushed_data_blocks &&
>>> + (journal->j_fs_dev != journal->j_dev) &&
>>> + (journal->j_flags & JBD2_BARRIER))
>>> + blkdev_issue_flush(journal->j_fs_dev, NULL);
>>> +
>> I am afraid this is not enough. This code is called after journal
>> was flushed for async commit case, so it leaves a race window where
>> journal transaction is already on disk and complete, but the data is
>> still in cache somewhere.
> No, that's actually fine. In the ASYNC_COMMIT case, the commit won't
> be valid until the checksum is correct, and we won't have written any
> descriptor blocks yet at this point. So there is no race because

What do you mean by descriptor blocks? Actual logged blocks?

> during that window, the commit is written but we won't write any
> descriptor blocks until after the barrier returns.

>From my reading of the code, in jbd2_journal_commit_transaction:
After first "JBD: commit phase 2" message we submit actual data buffers.
Then after second "JBD: commit phase 2" message (in the
"while (commit_transaction->t_buffers) {" loop) we actually iterate
through all metadata changes and submit them to disk (after
start_journal_io: label)
Then journal_submit_commit_record writes commit record.
At this point on we have entire transacton + commit
block in the write queue (or partially already on disk) +
all the data blocks in the write queue.
Next blkdev_issue_flush on journal device is done, cementing the transaction
on the permanent storage, but the actual data is still potentially not
there abd it is not until the call to journal_finish_inode_data_buffers()
where that data is flushed too.

If you refer to the code after "JBD: commit phase 3" that waits on actual
transaction blocks buffers to make sure they made it to the disk,
my reading of blkdev_issue_flush is that it adds the flush at the end of the
queue, i.e. after all of the journalled blocks that are sitting in there,
so they all would be already copleted by the time blkdev_issue_flush returns.

Bye,
Oleg--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/