[RFC v4] ext4: Coordinate fsync requests

From: Darrick J. Wong
Date: Wed Aug 18 2010 - 22:14:55 EST


This patch attempts to coordinate barrier requests being sent in by fsync.
Instead of each fsync call initiating its own barrier, there's now a flag to
indicate if (0) no barriers are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a barrier.

So, if someone calls ext4_sync_file and no barriers are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are
any other threads that are close behind in ext4_sync_file. After that wait,
the state transitions to 2 and the barrier is issued. Once that's done, the
state goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled. Instead of issuing a barrier themselves, they simply
wait for that first thread to do it for them.

In an earlier version of this patch, the "short time" was 500ms, adjustable by
a module parameter. Now, it's a mount option, and the mount option has three
values: x = 0, which gets you the old behavior; x = -1, which uses the average
commit time for the delay period; and x > 0, which sets the delay time to
min(x, average commit time).

So I gave this patchset some wider testing in my lab, and came up with the
following spreadsheet:
https://spreadsheets.google.com/ccc?key=0AtDnBlSkyNjodDZ4OEdSZC01X2haVi1ncmpGOXpfNkE&hl=en&authkey=CMznqcgH

These results do not reflect Tejun/Christoph's latest barrier/flush semantic
changes because ... I haven't figured out the correct sequence of 2.6.35+,
tejun's patchset, and hch's patchset. With Tejun's original set of patches,
the performance actually decreased a little bit, so I haven't bothered to
upload those results while I try to put together a new tree.

The comma-separated numbers in the leftmost column indicates which options were
in effect during the run. From left to right, they are (1) whether or not
directio is enabled; (2) the max_fsync_delay parameter (-1 is automatic
tuning, 0 is the behavior prior to my patch, and 500 is the v3 patch);
(3) whether or not Jan Kara's barrier generation counter is disabled; and (4)
whether or not my old dirty flag patch is disabled. The old system behaviors
are in the *,0,1,1 rows, and the everything-on behavior should be in the
*,-1,0,0 rows.

>From these results, it seems like a reasonable conclusion that enabling Jan
Kara's barrier generation counter patch (*,*,0,*) does increase the tps count.
It also would appear that enabling the barrier coordination patch with the
automatic barrier delay tuning (*,-1,*,*) also seems to be increasing tps while
reducing barrier counts. It's not so clear that my old dirty flag patch
(*,*,*,0) is doing much anymore.

The *,na,na,na rows reflect what happens if I turn barriers off entirely, so I
can compare the new numbers with the nobarrier behavior.

elm3c44_sas is a 36-spindle SAS storage array (RAID0).
elm3c44_ssd is 4 Intel SSDs tied together in a RAID0 via md.
elm3c65_sata is a single SATA disk.
elm3c71_sas is a 8-spindle SAS storage array (RAID0).
elm3c71_scsi is a 14-spindle SCSI storage array (RAID0).
elm3c75_ide is a single IDE laptop disk.

The storage array have battery-backed WC.

Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
---

fs/ext4/ext4.h | 7 +++++++
fs/ext4/fsync.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/super.c | 21 ++++++++++++++++++++-
3 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..9e8dcc7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1143,6 +1143,13 @@ struct ext4_sb_info {

/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;
+
+ /* fsync coordination */
+ spinlock_t barrier_flag_lock;
+ int in_barrier;
+ struct completion barrier_completion;
+ u64 avg_barrier_time;
+ int max_fsync_delay;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..e164ccd 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -75,9 +75,12 @@ int ext4_sync_file(struct file *file, int datasync)
{
struct inode *inode = file->f_mapping->host;
struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+ journal_t *journal = sb->s_journal;
int ret;
tid_t commit_tid;
+ ktime_t before, expires;
+ u64 duration;

J_ASSERT(ext4_journal_current_handle() == NULL);

@@ -130,8 +133,52 @@ int ext4_sync_file(struct file *file, int datasync)
blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
NULL, BLKDEV_IFL_WAIT);
ret = jbd2_log_wait_commit(journal, commit_tid);
- } else if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
- BLKDEV_IFL_WAIT);
+ } else if (journal->j_flags & JBD2_BARRIER) {
+ if (!sb->max_fsync_delay) {
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
+ BLKDEV_IFL_WAIT);
+ goto fast_out;
+ }
+again:
+ spin_lock(&sb->barrier_flag_lock);
+ if (sb->in_barrier == 2) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ goto again;
+ } else if (sb->in_barrier) {
+ spin_unlock(&sb->barrier_flag_lock);
+ ret = wait_for_completion_interruptible(&sb->barrier_completion);
+ } else {
+ sb->in_barrier = 1;
+ INIT_COMPLETION(sb->barrier_completion);
+ spin_unlock(&sb->barrier_flag_lock);
+
+ if (sb->max_fsync_delay > 0 && (sb->max_fsync_delay * 1000) < sb->avg_barrier_time)
+ expires = ktime_add_ns(ktime_get(), sb->max_fsync_delay * 1000);
+ else
+ expires = ktime_add_ns(ktime_get(), sb->avg_barrier_time);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 2;
+ spin_unlock(&sb->barrier_flag_lock);
+
+ before = ktime_get();
+ blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT);
+ duration = ktime_to_ns(ktime_sub(ktime_get(), before));
+ if (likely(sb->avg_barrier_time))
+ sb->avg_barrier_time = (duration + sb->avg_barrier_time * 15) / 16;
+ else
+ sb->avg_barrier_time = duration;
+
+ complete_all(&sb->barrier_completion);
+
+ spin_lock(&sb->barrier_flag_lock);
+ sb->in_barrier = 0;
+ spin_unlock(&sb->barrier_flag_lock);
+ }
+ }
+fast_out:
return ret;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..8d04e43 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -953,6 +953,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
if (!test_opt(sb, DELALLOC))
seq_puts(seq, ",nodelalloc");

+ seq_printf(seq, ",max_fsync_delay=%d", sbi->max_fsync_delay);

if (sbi->s_stripe)
seq_printf(seq, ",stripe=%lu", sbi->s_stripe);
@@ -1160,7 +1161,7 @@ enum {
Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
- Opt_discard, Opt_nodiscard,
+ Opt_discard, Opt_nodiscard, Opt_max_fsync_delay,
};

static const match_table_t tokens = {
@@ -1231,6 +1232,7 @@ static const match_table_t tokens = {
{Opt_dioread_lock, "dioread_lock"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
+ {Opt_max_fsync_delay, "max_fsync_delay=%d"},
{Opt_err, NULL},
};

@@ -1699,6 +1701,16 @@ set_qf_format:
case Opt_dioread_lock:
clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
break;
+ case Opt_max_fsync_delay:
+ if (args[0].from) {
+ if (match_int(&args[0], &option))
+ return 0;
+ } else
+ return 0;
+
+ sbi->max_fsync_delay = option;
+ ext4_msg(sb, KERN_INFO, "Maximum fsync delay %d\n", option);
+ break;
default:
ext4_msg(sb, KERN_ERR,
"Unrecognized mount option \"%s\" "
@@ -2562,6 +2574,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
if (!IS_EXT3_SB(sb))
set_opt(sbi->s_mount_opt, DELALLOC);

+ EXT4_SB(sb)->max_fsync_delay = -1;
+
if (!parse_options((char *) data, sb, &journal_devnum,
&journal_ioprio, NULL, 0))
goto failed_mount;
@@ -3045,6 +3059,11 @@ no_journal:
ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
"Opts: %s", descr, orig_data);

+ EXT4_SB(sb)->in_barrier = 0;
+ EXT4_SB(sb)->avg_barrier_time = 0;
+ spin_lock_init(&EXT4_SB(sb)->barrier_flag_lock);
+ init_completion(&EXT4_SB(sb)->barrier_completion);
+
lock_kernel();
kfree(orig_data);
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/