Re: [PATCH] Possible data loss on ext[34],reiserfs with external journal

From: Oleg Drokin
Date: Tue Dec 15 2009 - 01:33:23 EST

Next message: Arve Hjønnevåg: "Re: tree with htc dream support"
Previous message: Simon Horman: "Re: [PATCH] ipvs: Add boundary check on ioctl arguments"
In reply to: tytso: "Re: [PATCH] Possible data loss on ext[34], reiserfs with externaljournal"
Next in thread: tytso: "Re: [PATCH] Possible data loss on ext[34], reiserfs with externaljournal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello!

On Dec 15, 2009, at 12:32 AM, tytso@xxxxxxx wrote:
> Can you separate this patch into separate ones for each file system?

Sure.

> I think we can actually do better for ext4; for example, in the case
> of data=journal or data=writeback, it's not necessary to flush the fs
> data device. It's only the case of data=ordered that we need to send
> a barrier. With that optimization, we do need to add a barrier in the
> case of fsync() and when we are doing a journal checkpoint, but the
> extra code complexity is worth not having to force a barrier for the
> fs data device for every single commit, especially in the data=journal
> mode with a heavy fsync workload.

Indeed, this is a good idea too.
I think we still can squeeze a bit more juice out of it if we can
submit the data device flush right after submitting all the data, but
only do the waiting for it before issuing journal device flush in normal
journaling mode (we need to do the waiting before writing commit block
in async journaling mode).

> Do you have a test case that will allow you to easily try out this
> patch, in all of ext4's various journalling modes? Thanks!!

Not for vanilla kernel, but I'll see if I can construct something easily
for it.
> @@ -277,6 +278,16 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
> struct jbd2_inode *jinode, *next_i;
> int err, ret = 0;
>
> + /*
> + * If the journal is not located on the file system device,
> + * then we must flush the file system device before we issue
> + * the commit record
> + */
> + if (commit_transaction->t_flushed_data_blocks &&
> + (journal->j_fs_dev != journal->j_dev) &&
> + (journal->j_flags & JBD2_BARRIER))
> + blkdev_issue_flush(journal->j_fs_dev, NULL);
> +

I am afraid this is not enough. This code is called after journal was flushed for
async commit case, so it leaves a race window where journal transaction is already
on disk and complete, but the data is still in cache somewhere.

Also the callsite has this comment which is misleading, I think:
/*
* This is the right place to wait for data buffers both for ASYNC
* and !ASYNC commit. If commit is ASYNC, we need to wait only after
* the commit block went to disk (which happens above). If commit is
* SYNC, we need to wait for data buffers before we start writing
* commit block, which happens below in such setting.
*/

In fact it is only safe to wait here in ASYNC mode (and internal journal only) because
the device was already flushed (and I hope all device drivers drain the queue too, if not,
even this might be problematic) by the blkdev_issue_flush.

Bye,
Oleg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Arve Hjønnevåg: "Re: tree with htc dream support"
Previous message: Simon Horman: "Re: [PATCH] ipvs: Add boundary check on ioctl arguments"
In reply to: tytso: "Re: [PATCH] Possible data loss on ext[34], reiserfs with externaljournal"
Next in thread: tytso: "Re: [PATCH] Possible data loss on ext[34], reiserfs with externaljournal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]