Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes

From: Jeff Layton
Date: Thu Apr 07 2022 - 16:34:21 EST


On Fri, 2022-04-08 at 03:24 +0800, Xiubo Li wrote:
> On 4/8/22 3:16 AM, Jeff Layton wrote:
> > On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
> > > On 4/7/22 11:15 PM, Luís Henriques wrote:
> > > > When doing a direct/sync write, we need to invalidate the page cache in
> > > > the range being written to. If we don't do this, the cache will include
> > > > invalid data as we just did a write that avoided the page cache.
> > > >
> > > > Signed-off-by: Luís Henriques <lhenriques@xxxxxxx>
> > > > ---
> > > > fs/ceph/file.c | 19 ++++++++++++++-----
> > > > 1 file changed, 14 insertions(+), 5 deletions(-)
> > > >
> > > > Changes since v3:
> > > > - Dropped initial call to invalidate_inode_pages2_range()
> > > > - Added extra comment to document invalidation
> > > >
> > > > Changes since v2:
> > > > - Invalidation needs to be done after a write
> > > >
> > > > Changes since v1:
> > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > > >
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 5072570c2203..97f764b2fbdd 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > > return ret;
> > > >
> > > > ceph_fscache_invalidate(inode, false);
> > > > - ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > > - pos >> PAGE_SHIFT,
> > > > - (pos + count - 1) >> PAGE_SHIFT);
> > > > - if (ret < 0)
> > > > - dout("invalidate_inode_pages2_range returned %d\n", ret);
> > > >
> > > > while ((len = iov_iter_count(from)) > 0) {
> > > > size_t left;
> > > > @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > > break;
> > > > }
> > > > ceph_clear_error_write(ci);
> > > > +
> > > > + /*
> > > > + * we need to invalidate the page cache here, otherwise the
> > > > + * cache will include invalid data in direct/sync writes.
> > > > + */
> > > > + ret = invalidate_inode_pages2_range(
> > > IMO we'd better use truncate_inode_pages_range() after write. The above
> > > means it's possibly will write the dirty pagecache back, which will
> > > overwrite and corrupt the disk data just wrote.
> > >
> > I disagree. We call filemap_write_and_wait_range at the start of this,
> > so any data that was dirty when we called write() will be written back
> > before the sync write.
> >
> > If we truncate the range, then we'll potentially lose writes that came
> > in after write was issued but before truncate_inode_pages_range. I think
> > we'd rather let what we just wrote be clobbered in this situation than
> > lose a write altogether.
> >
> > All of this is somewhat academic though. If you're mixing buffered and
> > direct writes like this without some sort of locking, then you're just
> > asking for trouble. The aim here is "sane behavior to the best of our
> > ability", but we can't expect it to always be sane when people do insane
> > things. ;)
>
> Just in the case Luis hit. Before writing the new data the mapping
> happen when reading the src in copy_from_usr(). So once the writing done
> the pagecache is caching the stale contents.
>

Not just in that case.

You could have 2 unrelated processes, one doing DIO writes and one doing
mmap writes. You're likely to end up with a mess unless you're very
careful with what you're doing, but there should be some expectation
that it will work if you serialize things correctly and/or have them
writing to their own areas of the file, etc.

In any case, we'll never get perfect cache coherency, and I figure that
until the write returns, what's in the pagecache ought to be considered
valid.

> > > Though it seems impossible that these pagecaches will be marked dirty,
> > > but this call is misleading ?
> > >
> > Not impossible at all. You can open a file O_DIRECT and then mmap the fd
> > for PROT_WRITE (or just open the file a second time and do it).
> >
> > We definitely recommend against mixing buffered and direct I/O, but
> > nothing really prevents someone from doing it. If the user is properly
> > using file locking, then there's really no reason it shouldn't work.
> >
> > > > + inode->i_mapping,
> > > > + pos >> PAGE_SHIFT,
> > > > + (pos + len - 1) >> PAGE_SHIFT);
> > > > + if (ret < 0) {
> > > > + dout("invalidate_inode_pages2_range returned %d\n",
> > > > + ret);
> > > > + ret = 0;
> > > > + }
> > > > pos += len;
> > > > written += len;
> > > > dout("sync_write written %d\n", written);
> > > >
>

--
Jeff Layton <jlayton@xxxxxxxxxx>