Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

From: John Stultz
Date: Tue Oct 02 2012 - 18:39:13 EST


On 10/02/2012 12:39 AM, NeilBrown wrote:
On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz <john.stultz@xxxxxxxxxx> wrote:

After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :) In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)
I wonder if you are trying to please everyone and risking pleasing no-one?
Well, maybe not quite that extreme, but you can't please all the people all
the time.
So while I do agree that I won't be able to please everyone, especially when it comes to how this interface is implemented internally, I do want to make sure that the userland interface really does make sense and isn't limited by my own short-sightedness. :)

For example, allowing sub-page volatile region seems to be above and beyond
the call of duty. You cannot mmap sub-pages, so why should they be volatile?
Although if someone marked a page and a half as volatile, would it be reasonable to throw away the second half of that second page? That seems unexpected to me. So we're really only marking the whole pages specified as volatlie, similar to how FALLOC_FL_PUNCH_HOLE behaves.

But if it happens that the adjacent range is also a partial page, we can coalesce them possibly into an purgable whole page. I think it makes sense, especially from a userland point of view and wasn't really complicated to add.

Similarly the suggestion of using madvise - while tempting - is probably a
minority interest and can probably be managed with library code. I'm glad
you haven't pursued it.
For now I see this as a lower priority, but its something I'd like to investigate. As depending on tmpfs has issues since there's no quota support, so having a user-writable tmpfs partition mounted is a DoS opening, especially on low-memory systems.

I think discarding whole ranges at a time is very sensible, and so merging
adjacent ranges is best avoided. If you require page-aligned ranges this
becomes trivial - is that right?
True. If we avoid coalescing non-whole page ranges, keeping non-overlapping ranges independent is fairly easy.

But it is also easy to avoid coalescing in all cases except when multiple sub-page ranges can be coalesced together.

In other words, we mark whole page portions of the range as volatile, and keep the sub-page portions separate. So non-page aligned ranges would possibly consist of three independent ranges, with the middle one as the only one marked volatile. Should those non-whole-page ranges be adjacent to other non-whole-page ranges, they could be coalesced. Since the coalesced edge ranges would be marked volatile after the full range, we would also avoid puriging the edge pages that would invalidate two unpurged range.

Alternatively, we can never coalesce and only mark whole pages in single ranges as volatile. It doesn't really make it more complex.

But again, these are implementation details.

The main point is I think at the user-interface level, allowing userland to provide non-page aligned ranges is valid. What we do with those non-page aligned chunks is up to the kernel/implementation, but I think we should be conservative and be sure never to purge non-volatile data.

I wonder if the oldest page/oldest range issue can be defined way by
requiring apps the touch the first page in a range when they touch the range.
Then the age of a range is the age of the first page. Non-initial pages
could even be kept off the free list .... though that might confuse NUMA
page reclaim if a range had pages from different nodes.
Not sure I followed this. Are you suggesting keeping non-initial ranges off the vmscan LRU lists entirely?

Another appraoch that was suggested that sounds similar is touching all the pages when we mark them as volatile, so they are all close to each other in the active/inactive list. Then when the vmscan shrink_lru_list() code runs it would purge the pages together (although it might only purge half a range if there wasn't the need for more memory). But again, these page-based solutions have much higher algorithmic complexity (O(n) - with respect to pages marked) and overhead.


Application to non-tmpfs files seems very unclear and so probably best
avoided.
If I understand you correctly, then you have suggested both that a volatile
range would be a "lazy hole punch" and a "don't let this get written to disk
yet" flag. It cannot really be both. The former sounds like fallocate,
the latter like fadvise.
I don't think I see the exclusivity aspect. If we say "Dear kernel, you may punch a hole at this offset in this file whenever you want in the future" and then later say "Cancel my earlier hole punching request" (which the kernel can say "Sorry, too late") it has very close semantics to what I'm describing with the abstract interface to volatile range. Maybe the only subtlety with the hole-punching oriented worldview is that the kernel is smart enough not bother writing out any data that could be punched out in the future.

But maybe this is a sufficient subtlety to still warrant avoiding it.

I think the later sounds more like the general purpose of volatile ranges,
but I also suspect that some journalling filesystems might be uncomfortable
providing a guarantee like that. So I would suggest firmly stating that it
is a tmpfs-only feature. If someone wants something vaguely similar for
other filesystems, let them implement it separately.
I mostly agree, as I don't have the context to see how this could be useful to other filesystems. So I'm limiting my functionality to tmpfs. However DaveC saw some value in allowing it to be extended to other filesystems, and I'm not opposed in seeing the same interface be used if the semantics are close enough.

From Dave's earlier mail:

"Managing large scale disk caches have exactly the same problems of
determining what to evict and/or move to secondary storage when
space is low. Being able to mark ranges of files as "remove this
first" woulxp dbe very advantageous for pro-active mangement of ENOSPC
conditions in the cache...

And being able to do space-demand hole-punching for stuff like
managing VM images would be really cool. For example, the loopback
device uses hole punching to implement TRIM commands, so turning
those into VOLATILE ranges for background dispatch will speed those
operations up immensely and avoid silly same block "TRIM - write -
TRIM - write" cyclic hole-punch/allocation in the backing file. KVM
could do the same for implementing TRIM on image based block
devices...

There's lots of things that can be done with a generic advisory,
asynchornous hole-punching interface."

Christoph also mentioned the concept would have some usefulness for persistent caches and I think xfsutils as well?

To me, it seems the dynamic is: fadvise is too wishy washy for anything that deals with persistent data on disk. Its more how the kernel memory management should manage file data. Where as fallocate has stronger semantics for the behavior of what happens on disk. So if this is really a tmpfs only feature, fadvise should be ok, but if it were ever to be useful for making actual changes to disk, fallocate would be better.

So just from that standpoint, fallocate might be a more flexible interface to use, since its really all the same for tmpfs.

But let me know if my read on things here is off.

The SIGBUS interface could have some merit if it really reduces overhead. I
worry about app bugs that could result from the non-deterministic
behaviour. A range could get unmapped while it is in use and testing for
the case of "get a SIGBUS half way though accessing something" would not
be straight forward (SIGBUS on first step of access should be easy).
I guess that is up to the app writer, but I have never liked anything about
the signal interface and encouraging further use doesn't feel wise.
Initially I didn't like the idea, but have warmed considerably to it. Mainly due to the concern that the constant unmark/access/mark pattern would be too much overhead, and having a lazy method will be much nicer for performance. But yes, at the cost of additional complexity of handling the signal, marking the faulted address range as non-volatile, restoring the data and continuing.

The use case for Mozilla is where there are compressed library files on disk, which are decompressed into memory to reduce the io. Then the entire in-memory library can be marked volatile, and will be re-fetched as needed. Basically allowing for filesystem independent disk compression (and more importantly for them - reduced io).

Hopefully that provides some extra context. Thanks again for the words of wisdom here. I do agree that at a certain point I will have to become less flexible, in order to push something upstream, but since interest in this work has been somewhat sporadic, I do want to make sure folks have at least a chance to "bend the sapling" this one last time. :)

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/