Re: [PATCH] writeback: Fix broken sync writeback

From: Linus Torvalds
Date: Fri Feb 12 2010 - 10:46:17 EST




On Fri, 12 Feb 2010, Jens Axboe wrote:
>
> This fixes it by using the passed in page writeback count, instead of
> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> finish properly even when new pages are being dirted.

This seems broken.

The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't
generate a single IO bigger than a couple of MB. And then we have that
loop around things to do multiple chunks. Your change to use nr_pages
seems to make the whole looping totally pointless, and breaks that "don't
do huge hunks" logic.

So I don't think that your patch is correct.

That said, I _do_ believe you when you say it makes a difference, which
makes me think there is a bug there. I just don't think you fixed the
right bug, and your change just happens to do what you wanted by pure
random luck.

The _real_ bug seems to bethe one you mentioned, but then ignored:

> Instead of flushing everything older than the sync run, it will do
> chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and
> over.

and it would seem that the _logical_ way to fix it would be something like
the appended...

Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that
any _future_ data is written back, so the 'oldest_jif' thing would seem to
be sane regardless of sync mode.

NOTE NOTE NOTE! I do have to admit that this patch scares me, because
there could be some bug in the 'older_than_this' logic that means that
somebody sets it even if the inode is already dirty. So this patch makes
conceptual sense to me, and I think it's the right thing to do, but I
also suspect that we do not actually have a lot of test coverage of the
whole 'older_than_this' logic, because it historically has been just a
small optimization for kupdated.

So this patch scares me, as it could break 'fdatasync' entirely. So
somebody should really double-check the whole 'dirtied_when' logic, just
to be safe. If anybody ever sets 'dirtied_when' to the current time even
if the inode is already dirty (and has an earlier dirtied_when'), then
that would open up 'fdatasync()' and friends up to not writing things
back properly at all (because a newer write set 'dirtied_when' so that
old writes get ignored and thought to be 'totally new')

Comments?

Linus

---
fs/fs-writeback.c | 15 ++++++++++-----
1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a7c42c..a0a8424 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb,
long wrote = 0;
struct inode *inode;

- if (wbc.for_kupdate) {
- wbc.older_than_this = &oldest_jif;
- oldest_jif = jiffies -
- msecs_to_jiffies(dirty_expire_interval * 10);
- }
+ /*
+ * We never write back data that was dirtied _after_ we
+ * started writeback. But kupdate doesn't even want to
+ * write back recently dirtied stuff, only older data.
+ */
+ oldest_jif = jiffies-1;
+ wbc.older_than_this = &oldest_jif;
+ if (wbc.for_kupdate)
+ oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10);
+
if (!wbc.range_cyclic) {
wbc.range_start = 0;
wbc.range_end = LLONG_MAX;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/