Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes

From: Xiubo Li
Date: Sat Apr 09 2022 - 21:36:07 EST



On 4/8/22 8:04 PM, Jeff Layton wrote:
On Fri, 2022-04-08 at 10:47 +0800, Xiubo Li wrote:
On 4/7/22 11:15 PM, Luís Henriques wrote:
When doing a direct/sync write, we need to invalidate the page cache in
the range being written to. If we don't do this, the cache will include
invalid data as we just did a write that avoided the page cache.

Signed-off-by: Luís Henriques <lhenriques@xxxxxxx>
---
fs/ceph/file.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

Changes since v3:
- Dropped initial call to invalidate_inode_pages2_range()
- Added extra comment to document invalidation

Changes since v2:
- Invalidation needs to be done after a write

Changes since v1:
- Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
- Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5072570c2203..97f764b2fbdd 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
return ret;
ceph_fscache_invalidate(inode, false);
- ret = invalidate_inode_pages2_range(inode->i_mapping,
- pos >> PAGE_SHIFT,
- (pos + count - 1) >> PAGE_SHIFT);
- if (ret < 0)
- dout("invalidate_inode_pages2_range returned %d\n", ret);
while ((len = iov_iter_count(from)) > 0) {
size_t left;
@@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
break;
}
ceph_clear_error_write(ci);
+
+ /*
+ * we need to invalidate the page cache here, otherwise the
+ * cache will include invalid data in direct/sync writes.
+ */
+ ret = invalidate_inode_pages2_range(
+ inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ (pos + len - 1) >> PAGE_SHIFT);
+ if (ret < 0) {
+ dout("invalidate_inode_pages2_range returned %d\n",
+ ret);
+ ret = 0;
For this, IMO it's not safe. If we just ignore it the pagecache will
still have invalid data.

That data is not invalid. It's dirty data from a buffered write that
raced with the DIO/sync write we're handling here. i.e. another write
came in while we were already processing the DIO write, but after the
point where we called filemap_write_and_wait.

When two write() calls to the same data are racing like that, the
outcome is undefined. We can't be certain which one will prevail as the
kernel could handle them in either order.

Okay, I think you are right.

-- Xiubo


The good news with Ceph/RADOS is that you shouldn't end up with a torn
write, unless the write happens to span multiple objects. Not much we
can do about that though.

I think what the 'ceph_direct_read_write()' does is more correct, it
will make sure all the dirty pages are writeback from the pagecaches by
using 'invalidate_inode_pages2_range()' without blocking and later will
do the invalidate blocked by using 'truncate_inode_pages_range()' if
some pages are not unmaped in 'invalidate_inode_pages2_range()' when EBUSY.

I'm not convinced this is any better, and it's attempting to impose a
deterministic outcome on a situation that is non-deterministic by
nature.

This can always be sure that the pagecache has no invalid data after
write finishes. I think why it use the truncate helper here is because
it's safe and there shouldn't have any buffer write happen for DIO ?

But from my understanding the 'ceph_direct_read_write()' is still buggy.
What if the page fault happen just after 'truncate_inode_pages_range()'
? Will this happen ? Should we leave this to use the file lock to
guarantee it in user space ?

Thought ?
Again, we can't really predict what the outcome of two racing writes to
the same area will do, so I don't see that there is a problem.