Re: [GIT PULL] Ext3 latency fixes

From: Linus Torvalds
Date: Thu Apr 09 2009 - 11:52:30 EST




On Wed, 8 Apr 2009, Theodore Ts'o wrote:
>
> One of these patches fixes a performance regression caused by a64c8610,
> which unplugged the write queue after every page write. Now that Jens
> added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> WRITE_SYNC, to avoid the implicit unplugging. These patches also seem
> to further improbve ext3 latency, especially during the "sync" command
> in Linus's write-big-file-and-sync workload.

So here's a question and a untested _conceptual_ patch.

The kind of writeback mode I'd personally prefer would be more of a
mixture of the current "data=writeback" and "data=ordered" modes, with
something of the best of both worlds. I'd like the data writeback to get
_started_ when the journal is written to disk, but I'd like it to not
block journal updates.

IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't
be totally unordered either.

For true sync operations (ie fsync()), the VFS layer then does the proper
"wait for data" part.

I dunno. I don't actually know the JBD internal constraints, but what I'm
talking about is something like the appended patch. It wouldn't help under
really heavy writeback IO (because even if we don't end up waiting for all
the random data to complete, we'd end up waiting when _submitting_ it),
but it might help under somewhat less extreme loads.

This is totally untested. It might well violate some serious internal jbd
rules and eat your filesystem, for all I know. I'm throwing the patch out
as a "would something _like_ this perhaps make sense as a half-way-point
between 'ordered' and 'writeback', nothing more.

Hmm?

Linus
---
fs/jbd/commit.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index a8e8513..5bea3ed 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -184,6 +184,9 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
}
}

+/* This would obviously be a real flag, set at mount time */
+#define BACKGROUND_DATA(journal) (1)
+
/*
* Submit all the data buffers to disk
*/
@@ -198,6 +201,9 @@ static int journal_submit_data_buffers(journal_t *journal,
struct buffer_head **wbuf = journal->j_wbuf;
int err = 0;

+ if (BACKGROUND_DATA(journal))
+ write_op = WRITE;
+
/*
* Whenever we unlock the journal and sleep, things can get added
* onto ->t_sync_datalist, so we have to keep looping back to
@@ -254,7 +260,10 @@ write_out_data:
if (locked && test_clear_buffer_dirty(bh)) {
BUFFER_TRACE(bh, "needs writeout, adding to array");
wbuf[bufs++] = bh;
- __journal_file_buffer(jh, commit_transaction,
+ if (BACKGROUND_DATA(journal))
+ __journal_unfile_buffer(jh);
+ else
+ __journal_file_buffer(jh, commit_transaction,
BJ_Locked);
jbd_unlock_bh_state(bh);
if (bufs == journal->j_wbufsize) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/