[PATCH 0/3] Volatile Ranges (v7) & Lots of words

From: John Stultz
Date: Fri Sep 28 2012 - 23:17:44 EST



After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :) In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)


General Interface semantics:
----------------------------------------------

The high level interface I've been pushing has so far stayed fairly
consistent:

Application marks a range of data as volatile. Volatile data may
be purged at any time. Accessing volatile data is undefined, so
applications should not do so. If the application wants to access
data in a volatile range, it should mark it as non-volatile. If any
of the pages in the range being marked non-volatile had been purged,
the kernel will return an error, notifying the application that the
data was lost.

But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect
"lazy" with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

Since this approach avoids the more explicit unmark/access/mark
pattern, it avoids the extra overhead required to ensure data is
non-volatile before being accessed.

However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.



Specific Interface semantics:
---------------------------------------------

Here are some of the open question about how the user interface
should look:

fadvise vs fallocate:

So while originally I used fadvise, currently my
implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE,
start, len) to mark a range as volatile and fallocate(fd,
FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges.

During kernel summit, the question was brought up if fallocate
was really the right interface to be using, and if fadvise
would be better. To me fadvise makes a little more sense,
but earlier it was pointed out that marking data ranges as
volatile could also be seen as a type of cancellable and lazy
hole-punching, so from that perspective fallocate might make
more sense. This is still an open question and I'd appreciate
further input here.


tmpfs vs non-shmem filesystems:
Android's ashmem primarily provides a way to get unlinked
tmpfs fds that can be shared between applications. Its
just an additional feature that those pages can "unpinned"
or marked volatile in my terminology. Thus in implementing
volatile ranges, I've focused on getting it to work on tmpfs
file descriptors. However, there has been some interest in
using volatile ranges with more traditional filesystems. The
semantics for how volatile range purging would work on a
real filesystem are not well established, and I can't say I
understand the utility quite yet, but there may be a case for
having data that you know won't be committed to disk until it
is marked as non-volatile. However, returning an EINVAL on
non-tmpfs filesystems until such a use is established should
be fine.

fd based interfaces vs madvise:
In talking with Taras Glek, he pointed out that for his
needs, the fd based interface is a little annoying, as it
requires having to get access to tmpfs file and mmap it in,
then instead of just referencing a pointer to the data he
wants to mark volatile, he has to calculate the offset from
start of the mmap and pass those file offsets to the interface.
Instead he mentioned that using something like madvise would be
much nicer, since they could just pass a pointer to the object
in memory they want to make volatile and avoid the extra work.

I'm not opposed to adding an madvise interface for this as
well, but since we have a existing use case with Android's
ashmem, I want to make sure we support this existing behavior.
Specifically as with ashmem applications can be sharing
these tmpfs fds, and so file-relative volatile ranges make
more sense if you need to coordinate what data is volatile
between two applications.

Also, while I agree that having an madvise interface for
volatile ranges would be nice, it does open up some more
complex implementation issues, since with files, there is a
fixed relationship between pages and the files' address_space
mapping, where you can't have pages shared between different
mappings. This makes it easy to hang the volatile-range tree
off of the mapping (well, indirectly via a hash table). With
general anonymous memory, pages can be shared between multiple
processes, and as far as I understand, don't have any grouping
structure we could use to determine if the page is in a
volatile range or not. We would also need to determine more
complex questions like: What are the semantics of volatility
with copy-on-write pages? I'm hoping to investigate this
idea more deeply soon so I can be sure whatever is pushed has
a clear plan of how to address this idea. Further thoughts
here would be appreciated.


It would really be great to get any thoughts on these issues, as they
are higher-priority to me then diving into the details of how we
implement this internally, which can shift over time.



Implementation Considerations:
---------------------------------------------

How best to manage volatile ranges internally in the kernel is still
an open question.

With this patch set, I'm really wanting to provide a proof of concept
of the general interface semantics above. This allows applications to
play with the idea and validate that it would work for them. Allowing
further discussion to continue on how to best implement or best allow
the implementation to evolve in the kernel.

Even so, I'm very interested in any discussion about how to manage
volatile ranges optimally.

Before describing the different management approaches I've tried,
there are some abstract properties and considerations that need to
be kept in mind:

* Range-granular Purging:
Since volatile ranges can be reclaimed at any time, the
question of how the kernel should reclaim volatile data
needs to be addressed. When a large data range is marked
as volatile, if any single page in that range is purged,
the application will get an error when it marks the range
as non-volatile. Thus when any single page in a range
is purged, the "value" of the entire range is destroyed.
Because of this property, it makes most sense to purge the
entire range together.


* Coalescing of adjacent volatile ranges:
With volatile ranges, any overlapping ranges are always
coalesced. However, there is an open question of what to
do with neighboring ranges. With Android's approach, any
neighboring ranges were coalesced into one range. I've since
tweaked this so that adjacent ranges are coalesced only if
both have not yet been purged (or both are already purged).
This avoids throwing away fine data just because its next
to data that has already been tossed. Not coalescing
non-overlapping ranges is also an option I've considered,
as it better follows the applications wishes, since as
the application is providing these non-overlapping ranges
separately, we should probably also purge them separately.
The one complication here is that for userlands-sake, we
manage volatile ranges at a byte level. So if an application
marks one an a half pages of data as volatile, we only purge
pages that are entirely volatile. This avoids accidentally
purging non-volatile data on the rest of the page. However,
if an array of sub-page sized data is marked volatile one by
one, coalescing the ranges allows us to purge a page that
consists entirely of multiple volatile ranges. So for now
I'm still coalescing assuming the neighbors are both unpurged,
but this behavior is open to being tweaked.


* Purging order between volatile ranges:
Again, since it makes sense to purge all the complete
pages in a range at the same time, we need to consider the
subtle difference between the least-recently-used pages vs
least-recently-used ranges. A single range could contain very
frequently accessed data, as well as rarely accessed data.
One must also consider that the act of marking a range as
volatile may not actually touch the underlying pages. Thus
purging ranges based on a least-recently-used page may also
result in purging the most-recently used page.

Android addressed the purging order question by purging ranges
in the order they were marked volatile. Thus the oldest
volatile range is the first range to be purged. This works
well in the Android model, as applications aren't supposed
to access volatile data, so the least-recently-marked-volatile
order maps well to the least-recently-used-range.

However, this assumption doesn't hold with the lazy SIGBUS
notification method, as pages in a volatile range may continue
to be accessed after the range is marked volatile. So the
question as to what is the best order of purging volatile
ranges is definitely open.

Abstractly the ideal solution might be to evaluate the
most-recently used page in each range, and to purge the range
with the oldest recently-used-page, but I suspect this is
not something that could be calculated efficiently.

Additionally, in my conversations with Taras, he pointed out
that if we are using a one-application-at-a-time UI model,
it would be ideal to discourage purging volatile data used by
the current application, instead prioritizing volatile ranges
from applications that aren't active. However, I'm not sure
what mechanism could be used to prioritize range purging in
this fashion, especially considering volatile ranges can be
on data that is shared between applications.


* Volatile range purging order relative to non-volatile pages:
Initially I had proposed that since applications had offered
data up as unused, volatile ranges should be purged before we
try to free any other pages in the system. At Plumbers, Andrea
pointed out that this doesn't make much sense, as there may be
inactive file pages from some streaming file data which are not
going to be used any time soon, and would be a better candidate
to free then an application's volatile pages. This sounded
quite reasonable, so its likely we need to balance volatile
purging with freeing other pages in the system. However, I do
think it is advantageous to purge volatile pages before we
free any dirty pages that must be laundered, as part of the
goal of volatile pages is to avoid extra io. Although from
my reading of shrink_page_list in vmscan.c I'm not sure I see
if/how we prioritize freeing clean pages prior to dirty ones.


So with that background covered, on to discussing actual
implementations.

Implementation Details:
---------------------------------------------

There is two rough approaches that I have tried so far

1) Managing volatile range objects, in a tree or list, which are then
purged using a shrinker

2) Page based management, where pages marked volatile are moved to
a new LRU list and are purged from there.



1) This patchset is of the the shrinker-based approach. In many ways it
is simpler, but it does have a few drawbacks. Basically when marking a
range as volatile, we create a range object, and add it to an rbtree.
This allows us to be able to quickly find ranges, given an address in
the file. We also add each range object to the tail of a filesystem
global linked list, which acts as an LRU allowing us to quickly find
the least recently created volatile range. We then use a shrinker
callback to trigger purging, where we'll select the range on the head
of the LRU list, purge the data, mark the range object as purged,
and remove it from the lru list.

This allows fairly efficient behavior, as marking and unmarking
a range are both O(logn) operation with respect to the number of
ranges, to insert and remove from the tree. Purging the range is
also O(1) to select the range, and we purge the entire range in
least-recently-marked-volatile order.

The drawbacks with this approach is that it uses a shrinker, thus it is
numa un-aware. We track the virtual address of the pages in the file,
so we don't have a sense of what physical pages we're using, nor on
which node those pages may be on. So its possible on a multi-node
system that when one node was under pressure, we'd purge volatile
ranges that are all on a different node, in effect throwing data away
without helping anything. This is clearly non-ideal for numa systems.

One idea I discussed with Michel Lespinasse is that this might be
something we could improve by providing the shrinker some node context,
then keep track in the range what node their first page is on. That
way we would be sure to at least free up one page on the node under
pressure when purging that range.


2) The second approach, which was more page based, was also tried. In
this case when we marked a range as volatile, the pages in that range
were moved to a new lru list LRU _VOLATILE in vmscan.c. This provided
a page lru list that could be used to free pages before looking at
the LRU_INACTIVE_FILE/ANONYMOUS lists.

This integrates the feature deeper in the mm code, which is nice,
especially as we have an LRU_VOLATILE list for each numa node. Thus
under pressure we won't purge ranges that are entirely on a different
node, as is possible with the other approach.

However, this approach is more costly. When marking a range
as volatile, we have to migrate every page in that range to the
LRU_VOLATILE list, and similarly on unmarking we have to move each
page back. This ends up being O(n) with respect to the number of
pages in the range we're marking or unmarking. Similarly when purging,
we let the scanning code select a page off the lru, then we have to
map it back to the volatile range so we can purge the entire range,
making it a more expensive O(logn), with respect to the number of
ranges, operation.

This is a particular concern as applications that want to mark and
unmark data as volatile with fine granularity will likely be calling
these operations frequently, adding quite a bit of overhead. This
makes it less likely that applications will choose to volunteer data
as volatile to the system.

However, with the new lazy SIGBUS notification, applications using
the SIGBUS method would avoid having to mark and unmark data when
accessing it, so this overhead may be less of a concern. However, for
cases where applications don't want to deal with the SIGBUS and would
rather have the more deterministic behavior of the unmark/access/mark
pattern, the performance is a concern.

Additionally, there may be ways to defer and batch the page migration
so that applications don't suffer the extra cost, but this solution
may be limited or could cause some strange behavior, as we can't
defer the unmark method, as we don't want pages to be purged after
the application thinks they were unmarked.


Whew, that was long...

Anyway, if you got this far and are still interested, I'd be greatly
appreciate hearing of any other suggested implementations, or ways
around the drawbacks of the already tried approaches.

thanks
-john


For this v7 patchset revision the changes are as follows:
* Dropped the LRU_VOLATILE approach for now so we can focus on
getting the general interface semantics agreed upon
* Converted to using byte ranges rather then page ranges to make
userland's life easier.
* Add SIGBUS on purged page access behavior, allowing for access
of volatile data without having to unmark it.


Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx>
Cc: Robert Love <rlove@xxxxxxxxxx>
Cc: Mel Gorman <mel@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Dmitry Adamushko <dmitry.adamushko@xxxxxxxxx>
Cc: Dave Chinner <david@xxxxxxxxxxxxx>
Cc: Neil Brown <neilb@xxxxxxx>
Cc: Andrea Righi <andrea@xxxxxxxxxxxxxxx>
Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Cc: Mike Hommey <mh@xxxxxxxxxxxx>
Cc: Taras Glek <tglek@xxxxxxxxxxx>
Cc: Jan Kara <jack@xxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx>
Cc: Michel Lespinasse <walken@xxxxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: linux-mm@xxxxxxxxx <linux-mm@xxxxxxxxx>


John Stultz (3):
[RFC] Add volatile range management code
[RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
[RFC] ashmem: Convert ashmem to use volatile ranges

drivers/staging/android/ashmem.c | 335 +---------------------
fs/open.c | 3 +-
include/linux/falloc.h | 7 +-
include/linux/volatile.h | 46 +++
mm/Makefile | 2 +-
mm/shmem.c | 120 ++++++++
mm/volatile.c | 580 ++++++++++++++++++++++++++++++++++++++
7 files changed, 763 insertions(+), 330 deletions(-)
create mode 100644 include/linux/volatile.h
create mode 100644 mm/volatile.c

--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/