On Wed, Mar 20, 2013 at 06:29:38PM -0700, John Stultz wrote:On 03/12/2013 12:38 AM, Minchan Kim wrote:For exmaple, some process makes 64M vranges and now kernel needs 8MFirst of all, let's define the term.
From now on, I'd like to call it as vrange(a.k.a volatile range)
for anonymous page. If you have a better name in mind, please suggest.
This version is still *RFC* because it's just quick prototype so
it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
Before further sorting out issues, I'd like to post current direction
and discuss it. Of course, I'd like to extend this discussion in
comming LSF/MM.
In this version, I changed lots of thing, expecially removed vma-based
approach because it needs write-side lock for mmap_sem, which will drop
performance in mutli-threaded big SMP system, KOSAKI pointed out.
And vma-based approach is hard to meet requirement of new system call by
John Stultz's suggested semantic for consistent purged handling.
(http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
I tested this patchset with modified jemalloc allocator which was
leaded by Jason Evans(jemalloc author) who was interest in this feature
and was happy to port his allocator to use new system call.
Super Thanks Jason!
The benchmark for test is ebizzy. It have been used for testing the
allocator performance so it's good for me. Again, thanks for recommending
the benchmark, Jason.
(http://people.freebsd.org/~kris/scaling/ebizzy.html)
The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
ebizzy -S 20
jemalloc-vanilla: 52389 records/sec
jemalloc-vrange: 203414 records/sec
ebizzy -S 20 with background memory pressure
jemalloc-vanilla: 40746 records/sec
jemalloc-vrange: 174910 records/sec
And it's much improved on KVM virtual machine.
This patchset is based on v3.9-rc2
- What's the sys_vrange(addr, length, mode, behavior)?
It's a hint that user deliver to kernel so kernel can *discard*
pages in a range anytime. mode is one of VRANGE_VOLATILE and
VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
kernel coudn't discard any pages any more while VRANGE_VOLATILE
is memory unpin opeartion so kernel can discard pages in vrange
anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
discard page in a vrange, please, discard all of pages in a
vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
that please discard of some pages in a vrange. But now I didn't
implemented VRANGE_PARTIAL handling yet.
So I'm very excited to see this new revision! Moving away from the
VMA based approach I think is really necessary, since managing the
volatile ranges on a per-mm basis really isn't going to work when we
want shared volatile ranges between processes (such as the
shmem/tmpfs case Android uses).
Just a few questions and observations from my initial playing around
with the patch:
1) So, I'm not sure I understand the benefit of VRANGE_PARTIAL. Why
would VRANGE_PARTIAL be useful?
pages to flee from memory pressure state. In this case, we don't need
to discard 64M all at once because if we discard only 8M page, the cost
of allocator is (8M/4K) * page(falut + allocation + zero-clearing)
while (64M/4K) * page(falut + allocation + zero-clearing), otherwise.
If it were temporal image extracted on some compressed format, it's not
easy to regenerate punched hole data from original source so it would
be better to discard all pages in the vrange, which will be very far
from memory reclaimer.
2) I've got a trivial test program that I've used previously withI don't know why we should inherit volatility to child at least, for
ashmem & my earlier file based efforts that allocates 26megs of page
aligned memory, and marks every other meg as volatile. Then it forks
and the child generates a ton of memory pressure, causing pages to
be purged (and the child killed by the OOM killer). Initially I
didn't see my test purging any pages with your patches. The problem
of course was the child's COW pages were not also marked volatile,
so they could not be purged. Once I over-wrote the data in the
child, breaking the COW links, the data in the parent was purged
under pressure. This is good, because it makes sure we don't purge
cow pages if the volatility state isn't consistent, but it also
brings up a few questions:
- Should volatility be inherited on fork? If volatility is not
inherited on fork(), that could cause some strange behavior if the
data was purged prior to the fork, and also its not clear what the
behavior of the child should be with regards to data that was
volatile at fork time. However, we also don't want strange behavior
on exec if overwritten volatile pages were unexpectedly purged.
anon vrange. Because it's not proper way to share the data.
For data sharing for anonymous page, we should use shmem so the work
could be done when we work tmpfs work, I guess.
discard_vpage is for avoiding swapping out in direct reclaim path
4) One of the harder aspects I'm trying to get my head around is how
your patches seem to use both the page list shrinkers
(discard_vpage) to purge ranges when particular pages selected, and
a zone shrinker (discard_vrange_pages) which manages its own lru of
vranges. I get that this is one way to handle purging anonymous
pages when we are on a swapless system, but the dual purging systems
definitely make the code harder to follow. Would something like my
when kswapd miss the page.
discard_vrange_pages is for handling volatile pages as top prioirty
prio to reclaim non-volatile pages.
I think it's very clear, NOT to understand. :)
And discard_vpage is basic core function to discard volatile page
so it could be used many places.
earlier attempts at changing vmscan to shrink anonymous pages beThere were many attempt at old. Could you point out?
simpler? Or is that just not going to fly w/ the mm folks?