nfs invalidates lose pte dirty bits

From: Andrea Arcangeli
Date: Thu Dec 22 2005 - 12:55:07 EST


Hello Andrew,

this is yet another problem exposed by the new invalidate_inode_pages2
semantics.

If a mapping has dirty data (pte dirty and page clean), and an
invalidate triggers, the filemap_write_and_wait will do nothing, but the
invalidate_inode_pages2 will destroy the pte dirty bit and discard the
pagecache.

This is an attempted untested fix, however I'm not sure if it's enough
because we'd be still losing some dirty bit if the application keeps
touching the mapping in between the unmap_mapping_range and the
invalidate_inode_pages2. So perhaps this should be combined with a non
destructive version of invalidate_inode_pages2 that will simply skip
over any dirty page (because we have the guarantee that the dirty
information has been generated by userspace after the
filemap_write_and_wait). Invalidates may be spurious, it's not like with
O_DIRECT were we're guaranteed we overwritten the stuff in the atomic
context of the write() syscall.

However for the application being tested I think this should be enough,
since it looked single threaded.

An incremental patch could change the call to invalidate_inode_pages2 to
something else like invalidate_inode_clean_pages so that multithreaded
programs won't risk to lose dirty bits in the ptes too.

Or you may change your mind and we can go back to a much simpler relaxed
semantic of invalidates, where people notices changes in mapping only
after an munmap (like 2.4), that only requires allowing PG_uptodate
pages to be mapped in userspace like 2.4 (you know I tend to prefer
those semantics, even though they're not even close to distributed
shared memory ;).

I'm also wondering why you added this -EIO in invalidate_inode_pages2:

if (!invalidate_complete_page(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
ret = -EIO;
}

When -EIO failures triggers because of a racy invalidate, the db write
will get -EIO as retval. Especially the PageDirty check in
invalidate_complete_page could return 0, but it doesn't make sense to
check the PageDirty bit inside invalidate_inode_pages2, and we're not
required to do the set_page_dirty either, the whole point of
invalidate_inode_pages2 is to destroy dirty bits. We should go ahead and
never returned -EIO there if the page was found dirty again, right? I
guess it's ok but it doesn't make much sense and I'd rather avoid -EIO
unless they're strictly necessary, because every time a db gets a -EIO
we get a critical bugreport.

Only a potential invalidate_inode_clean_pages for nfs invalidates, would
have to preserve the dirty bits and never call test_clear_page_dirty in
the first place.

Thanks!

Signed-off-by: Andrea Arcangeli <andrea@xxxxxxx>

diff -r 994d5b26cde2 fs/nfs/inode.c
--- a/fs/nfs/inode.c Mon Dec 19 23:00:04 2005 +0000
+++ b/fs/nfs/inode.c Thu Dec 22 18:37:25 2005 +0100
@@ -1180,8 +1180,9 @@

if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
if (S_ISREG(inode->i_mode)) {
- if (filemap_fdatawrite(mapping) == 0)
- filemap_fdatawait(mapping);
+ if (mapping_mapped(inode->i_mapping))
+ unmap_mapping_range(inode->i_mapping, 0, -1, 0);
+ filemap_write_and_wait(inode->i_mapping);
nfs_wb_all(inode);
}
invalidate_inode_pages2(mapping);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/