Re: [announce] [patch] KVM paravirtualization for Linux

From: Ingo Molnar
Date: Sun Jan 07 2007 - 12:49:24 EST



* Avi Kivity <avi@xxxxxxxxxxxx> wrote:

> >2-task context-switch performance (in microseconds, lower is better):
> >
> > native: 1.11
> > ----------------------------------
> > Qemu: 61.18
> > KVM upstream: 53.01
> > KVM trunk: 6.36
> > KVM trunk+paravirt/cr3: 1.60
> >
> >i.e. 2-task context-switch performance is faster by a factor of 4, and
> >is now quite close to native speed!
> >
>
> Very impressive! The gain probably comes not only from avoiding the
> vmentry/vmexit, but also from avoiding the flushing of the global page
> tlb entries.

90% of the win comes from the avoidance of the VM exit. To quantify this
more precisely i have added an artificial __flush_tlb_global() call to
after switch_to(), just to see how much impact an extra global flush has
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75,
which means that the global flush component is at around 0.5 usecs.

> >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
> >
> > native: 0.25
> > ----------------------------------
> > Qemu: 7.8
> > KVM upstream: 2.8
> > KVM trunk: 0.55
> > KVM paravirt/cr3: 0.36
> >
> >almost twice as fast.
> >
> >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
> >
> > native: 0.9
> > ----------------------------------
> > Qemu: 35.2
> > KVM upstream: 9.4
> > KVM trunk: 2.8
> > KVM paravirt/cr3: 2.2
> >
> >still a 30% improvement - which isnt too bad considering that 200 tasks
> >are context-switching in this workload and the cr3 cache in current CPUs
> >is only 4 entries.
> >
>
> This is a little too good to be true. Were both runs with the same
> KVM_NUM_MMU_PAGES?

yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk'
run should have been labeled as: 'cr3 tree with paravirt turned off'.
That's not completely 'trunk' but close to it, and all other changes
(like elimination of unnecessary TLB flushes) are fairly applied to
both.

i also did a run with much less MMU cache pages of 256, and hackbench 1
stayed the same, while hackbench 5 numbers started fluctuating badly (i
think that workload if trashing the MMU cache badly).

> I'm also concerned that at this point in time the cr3 optimizations
> will only show an improvement in microbenchmarks. In real life
> workloads a context switch is usually preceded by an I/O, and with the
> current sorry state of kvm I/O the context switch time would be
> dominated by the I/O time.

oh, i agreed completely - but in my opinion accelerating virtual I/O is
really easy. Accelerating the context-switch path (and basic syscall
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs
well in all the hard cases, before looking at the low hanging
performance fruits in the I/O area =B-)

also note that there's lots of internal reasons why application
workloads can be heavily context-switching - it's not just I/O that
generates them. (pipes, critical sections / futexes, etc.) So having
near-native performance for context-switches is very important.

> >+ if (irq & 8) {
> >+ outb(cached_slave_mask, PIC_SLAVE_IMR);
> >+ outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> >+ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI'
> >to master-IRQ2 */
> >+ } else {
> >+ outb(cached_master_mask, PIC_MASTER_IMR);
> >+ /* 'Specific EOI' to master: */
> >+ outb(0x60+irq, PIC_MASTER_CMD);
> >+ }
> >+ spin_unlock_irqrestore(&i8259A_lock, flags);
> >+}
>
> Any reason this can't be applied to mainline? There's probably no
> downside to native, and it would benefit all virtualization solutions
> equally.

this is legacy stuff ...

> >- u64 *pae_root;
> >+ u64 *pae_root[KVM_CR3_CACHE_SIZE];
>
> hmm. wouldn't it be simpler to have pae_root always point at the
> current root?

does that guarantee that it's available? I wanted to 'pin' the root
itself this way, to make sure that if a guest switches to it via the
cache, that it's truly available and a valid root. cr3 addresses are
non-virtual so this is the only mechanism available to guarantee that
the host-side memory truly contains a root pagetable.

> >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+ }
> > }
> > vcpu->mmu.root_hpa = INVALID_PAGE;
> > }
>
> You keep the page directories pinned here. [...]

yes.

> [...] This can be a problem if a guest frees a page directory, and
> then starts using it as a regular page. kvm sometimes chooses not to
> emulate a write to a guest page table, but instead to zap it, which is
> impossible when the page is freed. You need to either unpin the page
> when that happens, or add a hypercall to let kvm know when a page
> directory is freed.

the cache is zapped upon pagefaults anyway, so unpinning ought to be
possible. Which one would you prefer?

> >- for (i = 0; i < 4; ++i)
> >- vcpu->mmu.pae_root[i] = INVALID_PAGE;
> >+ for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> >+ /*
> >+ * When emulating 32-bit mode, cr3 is only 32 bits even on
> >+ * x86_64. Therefore we need to allocate shadow page tables
> >+ * in the first 4GB of memory, which happens to fit the DMA32
> >+ * zone:
> >+ */
> >+ page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> >+ if (!page)
> >+ goto error_1;
> >+
> >+ ASSERT(!vcpu->mmu.pae_root[j]);
> >+ vcpu->mmu.pae_root[j] = page_address(page);
> >+ for (i = 0; i < 4; ++i)
> >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+ }
>
> Since a pae root uses just 32 bytes, you can store all cache entries
> in a single page. Not that it matters much.

yeah - i wanted to extend the current code in a safe way, before
optimizing it.

> >+#define KVM_API_MAGIC 0x87654321
> >+
>
> <linux/kvm.h> is the vmm userspace interface. The guest/host
> interface should probably go somewhere else.

yeah. kvm_para.h?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/