Re: [RFC v4 0/3] Support volatile for anonymous range

From: Kamezawa Hiroyuki
Date: Tue Dec 25 2012 - 21:41:16 EST

Next message: Jaegeuk Kim: "Re: [PATCH review] f2fs: Don't assign e_id in f2fs_acl_from_disk"
Previous message: Cong Wang: "Re: [PATCH] fb: Rework locking to fix lock ordering on takeover"
Next in thread: Minchan Kim: "Re: [RFC v4 0/3] Support volatile for anonymous range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(2012/12/18 15:47), Minchan Kim wrote:
> This is still RFC because we need more input from user-space
> people and discussion about interface/reclaim policy of volatile
> pages and I want to expand this concept to tmpfs volatile range
> if it is possbile without big performance drop of anonymous volatile
> rnage (Let's define our term. anon volatile VS tmpfs volatile? John?)
>
> NOTE: I didn't consider THP/KSM so for test, you should disable them.
>
> I hope more inputs from user-space allocator people and test patch
> with their allocator because it might need design change of arena
> management for getting real vaule.
>
> Changelog from v4
>
> * Add new system call mvolatile/mnovolatile
> * Add sigbus when user try to access volatile range
> * Rebased on v3.7
> * Applied bug fix from John Stultz, Thanks!
>
> Changelog from v3
>
> * Removing madvise(addr, length, MADV_NOVOLATILE).
> * add vmstat about the number of discarded volatile pages
> * discard volatile pages without promotion in reclaim path
>
> This is based on v3.7
>
> - What's the mvolatile(addr, length)?
>
> It's a hint that user deliver to kernel so kernel can *discard*
> pages in a range anytime.
>

This can work against both of PRIVATE and SHARED mapping ?

What happens at fork() ? VOLATILE ranges are copied ?

> - What happens if user access page(ie, virtual address) discarded
> by kernel?
>
> The user can encounter SIGBUS.
>
> - What should user do for avoding SIGBUS?
> He should call mnovolatie(addr, length) before accessing the range
> which was called by mvolatile.
>
Will mnovolatile() return whether the range is discarded or not ?

What the user should do in signal handler ?
Can the all expected opereations be done in signal-safe manner ?
(IOW, can user do enough job easily without taking any locks in userland ?)

> - What happens if user access page(ie, virtual address) doesn't
> discarded by kernel?
>
> The user can see old data without page fault.
>

What happens when ther user calls mvolatile() against mlock()'d range or
calling mlock() against mvolatile()'d range ?

Hm, by the way, the user need to attach pages to the process by causing page-fault
(as you do by memset()) before calling mvolatile() ?

I think your approach is interesting, anyway.

Thanks,
-Kame

> - What's different with madvise(DONTNEED)?
>
> System call semantic
>
> DONTNEED makes sure user always can see zero-fill pages after
> he calls madvise while mvolatile can see old data or encounter
> SIGBUS.
>
> Internal implementation
>
> The madvise(DONTNEED) should zap all mapped pages in range so
> overhead is increased linearly with the number of mapped pages.
> Even, if user access zapped pages as write mode, page fault +
> page allocation + memset should be happened.
>
> The mvolatile just marks the flag in a range(ie, VMA) instead of
> zapping all of pte in the vma so it doesn't touch ptes any more.
>
> - What's the benefit compared to DONTNEED?
>
> 1. The system call overhead is smaller because mvolatile just marks
> the flag to VMA instead of zapping all the page in a range so
> overhead should be very small.
>
> 2. It has a chance to eliminate overheads (ex, zapping pte + page fault
> + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
> severe.
>
> 3. It has a potential to zap all ptes and free the pages if memory
> pressure is severe so reclaim overhead could be disappear - TODO
>
> - Isn't there any drawback?
>
> Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
> fault of other threads could be allowed. But m[no]volatile needs
> exclusive mmap_sem so other thread would be blocked if they try to
> access not-yet-mapped pages. That's why I design m[no]volatile
> overhead should be small as far as possible.
>
> It could suffer from max rss usage increasement because madvise(DONTNEED)
> deallocates pages instantly when the system call is issued while mvoatile
> delays it until memory pressure happens so if memory pressure is severe by
> max rss incresement, system would suffer. First of all, allocator needs
> some balance logic for that or kernel might handle it by zapping pages
> although user calls mvolatile if memory pressure is severe.
> The problem is how we know memory pressure is severe.
> One of solution is to see kswapd is active or not. Another solution is
> Anton's mempressure so allocator can handle it.
>
> - What's for targetting?
>
> Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
> of virtual machine like Dalvik. Also, it comes in handy for embedded
> which doesn't have swap device so they can't reclaim anonymous pages.
> By discarding instead of swapout, it could be used in the non-swap system.
> For it, we have to age anon lru list although we don't have swap because
> I don't want to discard volatile pages by top priority when memory pressure
> happens as volatile in this patch means "We don't need to swap out because
> user can handle the situation which data are disappear suddenly", NOT
> "They are useless so hurry up to reclaim them". So I want to apply same
> aging rule of nomal pages to them.
>
> Anonymous page background aging of non-swap system would be a trade-off
> for getting good feature. Even, we had done it two years ago until merge
> [1] and I believe gain of this patch will beat loss of anon lru aging's
> overead once all of allocator start to use madvise.
> (This patch doesn't include background aging in case of non-swap system
> but it's trivial if we decide)
>
> As another choice, we can zap the range like madvise(DONTNEED) when mvolatile
> is called if we don't have swap space.
>
> - Stupid performance test
> I attach test program/script which are utter crap and I don't expect
> current smart allocator never have done it so we need more practical data
> with real allocator.
>
> KVM - 8 core, 2G
>
> VOLATILE test
> 13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
> 0inputs+0outputs (0major+164050minor)pagefaults 0swaps
>
> DONTNEED test
> 23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
> 0inputs+0outputs (0major+16384210minor)pagefaults 0swaps
>
> x86-64 - 12 core, 2G
>
> VOLATILE test
> 33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
> 0inputs+0outputs (0major+245989minor)pagefaults 0swaps
>
> DONTNEED test
> 28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k
>
> [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>
> Any comments are welcome!
>
> Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> Cc: Arun Sharma <asharma@xxxxxx>
> Cc: sanjay@xxxxxxxxxx
> Cc: Paul Turner <pjt@xxxxxxxxxx>
> CC: David Rientjes <rientjes@xxxxxxxxxx>
> Cc: John Stultz <john.stultz@xxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Christoph Lameter <cl@xxxxxxxxx>
> Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx>
> Cc: Robert Love <rlove@xxxxxxxxxx>
> Cc: Mel Gorman <mel@xxxxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx>
> Cc: Rik van Riel <riel@xxxxxxxxxx>
> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> Cc: Neil Brown <neilb@xxxxxxx>
> Cc: Mike Hommey <mh@xxxxxxxxxxxx>
> Cc: Taras Glek <tglek@xxxxxxxxxxx>
> Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx>
> Cc: Christoph Lameter <cl@xxxxxxxxx>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> Minchan Kim (3):
> Introduce new system call mvolatile
> Discard volatile page
> add PGVOLATILE vmstat count
>
> arch/x86/syscalls/syscall_64.tbl | 3 +-
> include/linux/mm.h | 1 +
> include/linux/mm_types.h | 2 +
> include/linux/rmap.h | 3 +
> include/linux/syscalls.h | 2 +
> include/linux/vm_event_item.h | 2 +-
> mm/Makefile | 4 +-
> mm/huge_memory.c | 9 +-
> mm/ksm.c | 3 +-
> mm/memory.c | 2 +
> mm/migrate.c | 6 +-
> mm/mlock.c | 5 +-
> mm/mmap.c | 2 +-
> mm/mvolatile.c | 396 ++++++++++++++++++++++++++++++++++++++
> mm/rmap.c | 97 +++++++++-
> mm/vmscan.c | 4 +
> mm/vmstat.c | 1 +
> 17 files changed, 527 insertions(+), 15 deletions(-)
> create mode 100644 mm/mvolatile.c
>
> ================== 8< =============================
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <pthread.h>
> #include <sched.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/syscall.h>
>
> #define SYS_mvolatile 313
> #define SYS_mnovolatile 314
>
> #define ALLOC_SIZE (8 << 20)
> #define MAP_SIZE (ALLOC_SIZE * 10)
> #define PAGE_SIZE (1 << 12)
> #define RETRY 100
>
> pthread_barrier_t barrier;
> int mode;
> #define VOLATILE_MODE 1
>
> static int mvolatile(void *addr, size_t length)
> {
> return syscall(SYS_mvolatile, addr, length);
> }
>
> static int mnovolatile(void *addr, size_t length)
> {
> return syscall(SYS_mnovolatile, addr, length);
> }
>
> void *thread_entry(void *data)
> {
> unsigned long i;
> cpu_set_t set;
> int cpu = *(int*)data;
> void *mmap_area;
> int retry = RETRY;
>
> CPU_ZERO(&set);
> CPU_SET(cpu, &set);
> sched_setaffinity(0, sizeof(set), &set);
>
> mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> if (mmap_area == MAP_FAILED) {
> fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
> exit(1);
> }
>
> pthread_barrier_wait(&barrier);
>
> while(retry--) {
> if (mode == VOLATILE_MODE) {
> mvolatile(mmap_area, MAP_SIZE);
> for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
> mnovolatile(mmap_area + i, ALLOC_SIZE);
> memset(mmap_area + i, i, ALLOC_SIZE);
> mvolatile(mmap_area + i, ALLOC_SIZE);
> }
> } else {
> for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
> memset(mmap_area + i, i, ALLOC_SIZE);
> madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
> }
> }
> }
> return NULL;
> }
>
> int main(int argc, char *argv[])
> {
> int i, nr_thread;
> int *data;
>
> if (argc < 3)
> return 1;
>
> nr_thread = atoi(argv[1]);
> mode = atoi(argv[2]);
>
> pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
> data = malloc(sizeof(int) * nr_thread);
> pthread_barrier_init(&barrier, NULL, nr_thread);
>
> for (i = 0; i < nr_thread; i++) {
> data[i] = i;
> if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
> perror("Fail to create thread\n");
> exit(1);
> }
> }
>
> for (i = 0; i < nr_thread; i++) {
> if (pthread_join(thread[i], NULL))
> perror("Fail to join thread\n");
> printf("[%d] thread done\n", i);
> }
>
> return 0;
> }
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jaegeuk Kim: "Re: [PATCH review] f2fs: Don't assign e_id in f2fs_acl_from_disk"
Previous message: Cong Wang: "Re: [PATCH] fb: Rework locking to fix lock ordering on takeover"
Next in thread: Minchan Kim: "Re: [RFC v4 0/3] Support volatile for anonymous range"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]