Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILEflags

From: John Stultz
Date: Mon Apr 30 2012 - 15:41:27 EST


On 04/27/2012 07:04 PM, Dave Chinner wrote:
On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
On 04/26/2012 05:39 PM, Dave Chinner wrote:
On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
@@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
invalidate_mapping_pages(mapping, start_index,
end_index);
break;
+ case POSIX_FADV_VOLATILE:
+ /* First and last PARTIAL page! */
+ start_index = offset>> PAGE_CACHE_SHIFT;
+ end_index = endbyte>> PAGE_CACHE_SHIFT;
+ ret = mapping_range_volatile(mapping, start_index, end_index);
+ break;
+ case POSIX_FADV_NONVOLATILE:
+ /* First and last PARTIAL page! */
+ start_index = offset>> PAGE_CACHE_SHIFT;
+ end_index = endbyte>> PAGE_CACHE_SHIFT;
+ ret = mapping_range_nonvolatile(mapping, start_index,
+ end_index);
As it is, I'm still not sold on these being an fadvise() interface
because all it really is a delayed hole punching interface whose
functionailty is currently specific to tmpfs. The behaviour cannot
be implemented sanely by anything else at this point.
Yea. So I spent some time looking at the various hole punching
mechanisms and they aren't all together consistent across
filesystems. For instance, on some filesystems (ext4 and mostly disk
backed fs) you have to use fallocate(fd,
|FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
madvise(...,MADV_REMOVE). So in a way, currently, the
FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.
The MADVISE_REMOVE functionality for hole punching works *only* for
tmpfs - no other filesystem implements the .truncate_range() method.
In fact, several filesystems *can't* implement .truncate_range()
because there is no callout from the page cache truncation code to
allow filesystems to punch out the underlying blocks. The
vmtruncate() code is deprecated for this reason (and various others
like a lack of error handling), and .truncate_range() is just as
nasty. .truncate_range() needs to die, IMO.

So, rather than building more infrastructure on a nasty, filesystem
specific mmap() hack, implement .fallocate() on tmpfs and use the
same interface that every other filesystem uses for punching holes.

Ah. Ok. I wasn't aware that vmtruncate was deprecated. Thanks for cluing me in here!

This probably won't perform wonderfully, which is where the range
tracking and delayed punching (and the implied memory freeing)
optimiation comes into play. Sure, for tmpfs this can be implemented
as a shrinker, but for real filesystems that have to punch blocks a
shrinker is really the wrong context to be running such
transactions. However, using the fallocate() interface allows each
filesytsem to optimise the delayed hole punching as they see best,
something that cannot be done with this fadvise() interface.
So if a shrinker isn't the right context, what would be a good
context for delayed hole punching?
Like we in XFs for inode reclaim. We have a background workqueue
that frees aged inodes periodically in the fastest manner possible
(i.e. all async, no blocking on locks, etc), and the shrinker, when
run kicks that background thread first, and then enters into
synchronous reclaim. By the time a single sync reclaim cycle is run
and throttled reclaim sufficiently, the background thread has done a
great deal more work.

A similar mechanism can be used for this functionality within XFS.
Indeed, we could efficiently track which inodes have volatile ranges
on them via a bit in the radix trees than index the inode cache,
just like we do for reclaimable inodes. If we then used a bit in the
page cache radix tree index to indicate volatile pages, we could
then easily find the ranges we need to punch out without requiring
some new tree and more per-inode memory.

That's a very filesystem specific implementation - it's vastly
different to you tmpfs implementation - but this is exactly what I
mean about using fallocate to allow filesystems to optimise the
implementation in the most suitable manner for them....


So, just to make sure I'm folloiwng you, you're suggesting that there would be a filesystem specific implementation at the top level. Something like a mark_volatile(struct inode *, bool, loff_t, loff_t) inode operation? And the filesystem would then be responsible for managing the ranges and appropriately purging them?

Thanks again for the feedback, I'll continue looking into this.

thanks
-john




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/