[patch] aio: invalidate async directio writes

From: Jeff Moyer
Date: Wed Jun 18 2008 - 14:11:04 EST


Hi, Andrew,

This is a follow-up to:

commit bdb76ef5a4bc8676a81034a443f1eda450b4babb
Author: Zach Brown <zach.brown@xxxxxxxxxx>
Date: Tue Oct 30 11:45:46 2007 -0700

dio: fix cache invalidation after sync writes

Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
clean pages before dio write") introduced a bug which stopped dio from
ever invalidating the page cache after writes. It still invalidated it
before writes so most users were fine.

Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
this bug when he had a buffered reader immediately reading file data
after an O_DIRECT [writer] had written the data. The kernel issued
read-ahead beyond the position of the reader which overlapped with the
O_DIRECT writer. The failure to invalidate after writes caused the
reader to see stale data from the read-ahead.

The following patch is originally from Karl. The following commentary
is his:

The below 3rd try takes on your suggestion of just invalidating
no matter what the retval from the direct_IO call. I ran it
thru the test-case several times and it has worked every time.
The post-invalidate is probably still too early for async-directio,
but I don't have a testcase for that; just sync. And, this
won't be any worse in the async case.

I added a test to the aio-dio-regress repository which mimics Karl's IO
pattern. It verifed the bad behaviour and that the patch fixed it. I
agree with Karl, this still doesn't help the case where a buffered
reader follows an AIO O_DIRECT writer. That will require a bit more
work.

This gives up on the idea of returning EIO to indicate to userspace that
stale data remains if the invalidation failed.

Note the second-to-last paragraph, where it mentions that this does not fix
the AIO case. I updated the regression test to also perform asynchronous
I/O and verified that the problem does exist.

To fix the problem, we need to invalidate the pages that were under write
I/O after the I/O completes. Because the I/O completion handler can be called
in interrupt context (and invalidate_inode_pages2 cannot be called in interrupt
context), this patch opts to defer the completion to a workqueue. That
workqueue is responsible for invalidating the page cache pages and completing
the I/O.

I verified that the test case passes with the following patch applied.
Since this affects the code path for all async writers, I had our
performance group run a patched kernel through an OLTP workload. I'm
told that the noise level for this test is in the 5% range. The results
showed a 2.9% degradation in performance at 80 users, and 80 users was
the peak performance for the test. For other user counts, the patched
kernel varied from a percent better to a couple of percent worse. So,
overall I think the patch is worth taking, given that it fixes a
potential data corrupter. (Sorry I can't report the actual numbers,
since the tests were run on unreleased hardware.)

Comments, as always, are encouraged.

Cheers,

Jeff

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9e81add..90fa9e2 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -131,6 +131,8 @@ struct dio {
int is_async; /* is IO async ? */
int io_error; /* IO error in completion path */
ssize_t result; /* IO result */
+ struct list_head done_list; /* AIO DIO writes to be completed
+ * in process context */
};

/*
@@ -260,6 +262,40 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret)
return ret;
}

+static void aio_complete_fn(void *data);
+static DECLARE_WORK(aio_complete_work, aio_complete_fn, NULL);
+static DEFINE_SPINLOCK(iocb_completion_list_lock);
+static LIST_HEAD(iocb_completion_list);
+
+static void aio_complete_fn(void *data)
+{
+ unsigned long flags;
+ LIST_HEAD(private);
+ struct dio *dio, *tmp;
+
+ spin_lock_irqsave(&iocb_completion_list_lock, flags);
+ list_splice_init(&iocb_completion_list, &private);
+ spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
+
+ list_for_each_entry_safe(dio, tmp, &private, done_list) {
+ struct kiocb *iocb = dio->iocb;
+ loff_t offset = iocb->ki_pos;
+ struct file *file = iocb->ki_filp;
+ struct address_space *mapping = file->f_mapping;
+ int ret;
+ pgoff_t end = (offset + dio->size - 1) >> PAGE_CACHE_SHIFT;
+
+ /* and now invalidate the page cache */
+ ret = dio_complete(dio, offset, 0);
+ if (mapping->nrpages)
+ invalidate_inode_pages2_range(mapping,
+ offset >> PAGE_CACHE_SHIFT, end);
+ aio_complete(dio->iocb, ret, 0);
+ list_del(&dio->done_list);
+ kfree(dio);
+ }
+}
+
static int dio_bio_complete(struct dio *dio, struct bio *bio);
/*
* Asynchronous IO callback.
@@ -280,9 +316,22 @@ static void dio_bio_end_aio(struct bio *bio, int error)
spin_unlock_irqrestore(&dio->bio_lock, flags);

if (remaining == 0) {
- int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
- aio_complete(dio->iocb, ret, 0);
- kfree(dio);
+ /* For async O_DIRECT writes, we need to invalidate the
+ * page cache after the write completes. Kick off a
+ * workqueue to do this and issue the completion in process
+ * context.
+ */
+ if (dio->rw == READ) {
+ int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
+ aio_complete(dio->iocb, ret, 0);
+ kfree(dio);
+ } else {
+ unsigned long flags;
+ spin_lock_irqsave(&iocb_completion_list_lock, flags);
+ list_add(&dio->done_list, &iocb_completion_list);
+ spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
+ schedule_work(&aio_complete_work);
+ }
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/