Re: [PATCH v7 00/12] KVM: Add host swap event notifications for PVguest

From: Gleb Natapov
Date: Thu Oct 14 2010 - 05:22:13 EST


Ignore this please. Something bad happened to From: header.

On Thu, Oct 14, 2010 at 11:16:58AM +0200, y@xxxxxxxxxx wrote:
> From: Gleb Natapov <gleb@xxxxxxxxxx>
>
> KVM virtualizes guest memory by means of shadow pages or HW assistance
> like NPT/EPT. Not all memory used by a guest is mapped into the guest
> address space or even present in a host memory at any given time.
> When vcpu tries to access memory page that is not mapped into the guest
> address space KVM is notified about it. KVM maps the page into the guest
> address space and resumes vcpu execution. If the page is swapped out from
> the host memory vcpu execution is suspended till the page is swapped
> into the memory again. This is inefficient since vcpu can do other work
> (run other task or serve interrupts) while page gets swapped in.
>
> The patch series tries to mitigate this problem by introducing two
> mechanisms. The first one is used with non-PV guest and it works like
> this: when vcpu tries to access swapped out page it is halted and
> requested page is swapped in by another thread. That way vcpu can still
> process interrupts while io is happening in parallel and, with any luck,
> interrupt will cause the guest to schedule another task on the vcpu, so
> it will have work to do instead of waiting for the page to be swapped in.
>
> The second mechanism introduces PV notification about swapped page state to
> a guest (asynchronous page fault). Instead of halting vcpu upon access to
> swapped out page and hoping that some interrupt will cause reschedule we
> immediately inject asynchronous page fault to the vcpu. PV aware guest
> knows that upon receiving such exception it should schedule another task
> to run on the vcpu. Current task is put to sleep until another kind of
> asynchronous page fault is received that notifies the guest that page
> is now in the host memory, so task that waits for it can run again.
>
> To measure performance benefits I use a simple benchmark program (below)
> that starts number of threads. Some of them do work (increment counter),
> others access huge array in random location trying to generate host page
> faults. The size of the array is smaller then guest memory bug bigger
> then host memory so we are guarantied that host will swap out part of
> the array.
>
> I ran the benchmark on three setups: with current kvm.git (master),
> with my patch series + non-pv guest (nonpv) and with my patch series +
> pv guest (pv).
>
> Each guest had 4 cpus and 2G memory and was launched inside 512M memory
> container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
> threads and 4 working threads for a minute).
>
> Below is the total amount of "work" each guest managed to do
> (average of 10 runs):
> total work std error
> master: 122789420615 (3818565029)
> nonpv: 138455939001 (773774299)
> pv: 234351846135 (10461117116)
>
> Changes:
> v1->v2
> Use MSR instead of hypercall.
> Move most of the code into arch independent place.
> halt inside a guest instead of doing "wait for page" hypercall if
> preemption is disabled.
> v2->v3
> Use MSR from range 0x4b564dxx.
> Add slot version tracking.
> Support migration by restarting all guest processes after migration.
> Drop patch that tract preemptability for non-preemptable kernels
> due to performance concerns. Send async PF to non-preemptable
> guests only when vcpu is executing userspace code.
> v3->v4
> Provide alternative page fault handler in PV guest instead of adding hook to
> standard page fault handler and patch it out on non-PV guests.
> Allow only limited number of outstanding async page fault per vcpu.
> Unify gfn_to_pfn and gfn_to_pfn_async code.
> Cancel outstanding slow work on reset.
> v4->v5
> Move async pv cpu initialization into cpu hotplug notifier.
> Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
> Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
> cr3 back
> v5->v6
> To many. Will list only major changes here.
> Replace slow work with work queues.
> Halt vcpu for non-pv guests.
> Handle async PF in nested SVM mode.
> Do not prefault swapped in page for non tdp case.
> v6->v7
> Fix "GUP fail in work thread" problem
> Do prefault only if mmu is in direct map mode
> Use cpu->request to ask for vcpu halt (drop optimization that tried to
> skip non-present apf injection if page is swapped in before next vmentry)
> Keep track of synthetic halt in separate state to prevent it from leaking
> during migration.
> Fix memslot tracking problems.
> More documentation.
> Other small comments are addressed
>
> Gleb Natapov (12):
> Add get_user_pages() variant that fails if major fault is required.
> Halt vcpu if page it tries to access is swapped out.
> Retry fault before vmentry
> Add memory slot versioning and use it to provide fast guest write interface
> Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
> Add PV MSR to enable asynchronous page faults delivery.
> Add async PF initialization to PV guest.
> Handle async PF in a guest.
> Inject asynchronous page fault into a PV guest if page is swapped out.
> Handle async PF in non preemptable context
> Let host know whether the guest can handle async PF in non-userspace context.
> Send async PF when guest is not in userspace too.
>
> Documentation/kernel-parameters.txt | 3 +
> Documentation/kvm/cpuid.txt | 3 +
> Documentation/kvm/msr.txt | 36 ++++-
> arch/x86/include/asm/kvm_host.h | 28 +++-
> arch/x86/include/asm/kvm_para.h | 24 +++
> arch/x86/include/asm/traps.h | 1 +
> arch/x86/kernel/entry_32.S | 10 +
> arch/x86/kernel/entry_64.S | 3 +
> arch/x86/kernel/kvm.c | 315 +++++++++++++++++++++++++++++++++++
> arch/x86/kernel/kvmclock.c | 13 +--
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/Makefile | 1 +
> arch/x86/kvm/mmu.c | 61 ++++++-
> arch/x86/kvm/paging_tmpl.h | 8 +-
> arch/x86/kvm/svm.c | 45 ++++-
> arch/x86/kvm/x86.c | 192 +++++++++++++++++++++-
> fs/ncpfs/mmap.c | 2 +
> include/linux/kvm.h | 1 +
> include/linux/kvm_host.h | 39 +++++
> include/linux/kvm_types.h | 7 +
> include/linux/mm.h | 5 +
> include/trace/events/kvm.h | 95 +++++++++++
> mm/filemap.c | 3 +
> mm/memory.c | 31 +++-
> mm/shmem.c | 8 +-
> virt/kvm/Kconfig | 3 +
> virt/kvm/async_pf.c | 213 +++++++++++++++++++++++
> virt/kvm/async_pf.h | 36 ++++
> virt/kvm/kvm_main.c | 132 ++++++++++++---
> 29 files changed, 1255 insertions(+), 64 deletions(-)
> create mode 100644 virt/kvm/async_pf.c
> create mode 100644 virt/kvm/async_pf.h
>
> === benchmark.c ===
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
> #include <pthread.h>
>
> #define FAULTING_THREADS 1
> #define WORKING_THREADS 1
> #define TIMEOUT 5
> #define MEMORY 1024*1024*1024
>
> pthread_barrier_t barrier;
> volatile int stop;
> size_t pages;
>
> void *fault_thread(void* p)
> {
> char *mem = p;
>
> pthread_barrier_wait(&barrier);
>
> while (!stop)
> mem[(random() % pages) << 12] = 10;
>
> pthread_barrier_wait(&barrier);
>
> return NULL;
> }
>
> void *work_thread(void* p)
> {
> unsigned long *i = p;
>
> pthread_barrier_wait(&barrier);
>
> while (!stop)
> (*i)++;
>
> pthread_barrier_wait(&barrier);
>
> return NULL;
> }
>
> int main(int argc, char **argv)
> {
> int ft = FAULTING_THREADS, wt = WORKING_THREADS;
> unsigned int timeout = TIMEOUT;
> size_t mem = MEMORY;
> void *buf;
> int i, opt, verbose = 0;
> pthread_t t;
> pthread_attr_t pattr;
> unsigned long *res, sum = 0;
>
> while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
> switch (opt) {
> case 'f':
> ft = atoi(optarg);
> break;
> case 'w':
> wt = atoi(optarg);
> break;
> case 'm':
> mem = atoi(optarg);
> break;
> case 't':
> timeout = atoi(optarg);
> break;
> case 'v':
> verbose++;
> break;
> default:
> fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
> exit(1);
> }
> }
>
> if (verbose)
> printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);
>
> pages = mem >> 12;
> posix_memalign(&buf, 4096, pages << 12);
> res = malloc(sizeof (unsigned long) * wt);
> memset(res, 0, sizeof (unsigned long) * wt);
>
> pthread_attr_init(&pattr);
> pthread_barrier_init(&barrier, NULL, ft + wt + 1);
>
> for (i = 0; i < ft; i++) {
> pthread_create(&t, &pattr, fault_thread, buf);
> pthread_detach(t);
> }
>
> for (i = 0; i < wt; i++) {
> pthread_create(&t, &pattr, work_thread, &res[i]);
> pthread_detach(t);
> }
>
> /* prefault memory */
> memset(buf, 0, pages << 12);
> printf("start\n");
>
> pthread_barrier_wait(&barrier);
>
> pthread_barrier_destroy(&barrier);
> pthread_barrier_init(&barrier, NULL, ft + wt + 1);
>
> sleep(timeout);
> stop = 1;
>
> pthread_barrier_wait(&barrier);
>
> for (i = 0; i < wt; i++) {
> sum += res[i];
> printf("worker %d: %lu\n", i, res[i]);
> }
> printf("total: %lu\n", sum);
>
> return 0;
> }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/