Re: [RFC V2 0/9] x86/mmu:Introduce parallel memory virtualization to boost performance

From: Ben Gardon
Date: Thu Sep 24 2020 - 13:14:19 EST


On Wed, Sep 23, 2020 at 11:28 PM Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
>
> Any comments? Paolo! :)

Hi, sorry to be so late in replying! I wanted to post the first part
of the TDP MMU series I've been working on before responding so we
could discuss the two together, but I haven't been able to get it out
as fast as I would have liked. (I'll send it ASAP!) I'm hopeful that
it will ultimately help address some of the page fault handling and
lock contention issues you're addressing with these patches. I'd also
be happy to work together to add a prepopulation feature to it. I'll
put in some more comments inline below.

> On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
> >
> > Any comments? guys!
> > On Tue, 1 Sep 2020 at 19:52, <yulei.kernel@xxxxxxxxx> wrote:
> > >
> > > From: Yulei Zhang <yulei.kernel@xxxxxxxxx>
> > >
> > > Currently in KVM memory virtulization we relay on mmu_lock to
> > > synchronize the memory mapping update, which make vCPUs work
> > > in serialize mode and slow down the execution, especially after
> > > migration to do substantial memory mapping will cause visible
> > > performance drop, and it can get worse if guest has more vCPU
> > > numbers and memories.
> > >
> > > The idea we present in this patch set is to mitigate the issue
> > > with pre-constructed memory mapping table. We will fast pin the
> > > guest memory to build up a global memory mapping table according
> > > to the guest memslots changes and apply it to cr3, so that after
> > > guest starts up all the vCPUs would be able to update the memory
> > > simultaneously without page fault exception, thus the performance
> > > improvement is expected.

My understanding from this RFC is that your primary goal is to
eliminate page fault latencies and lock contention arising from the
first page faults incurred by vCPUs when initially populating the EPT.
Is that right?

I have the impression that the pinning and generally static memory
mappings are more a convenient simplification than part of a larger
goal to avoid incurring page faults down the line. Is that correct?

I ask because I didn't fully understand, from our conversation on v1
of this RFC, why reimplementing the page fault handler and associated
functions was necessary for the above goals, as I understood them.
My impression of the prepopulation approach is that, KVM will
sequentially populate all the EPT entries to map guest memory. I
understand how this could be optimized to be quite efficient, but I
don't understand how it would scale better than the existing
implementation with one vCPU accessing memory.

> > >
> > > We use memory dirty pattern workload to test the initial patch
> > > set and get positive result even with huge page enabled. For example,
> > > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > > dirty the entire memory region concurrently, as the initial patch
> > > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > > get the job done in about 50% faster.

In this benchmark did you include the time required to pre-populate
the EPT or just the time required for the vCPUs to dirty memory?
I ask because I'm curious if your priority is to decrease the total
end-to-end time, or you just care about the guest experience, and not
so much the VM startup time.
How does this compare to the case where 1 vCPU reads every page of
memory and then 32 vCPUs concurrently dirty every page?

> > >
> > > We only validate this feature on Intel x86 platform. And as Ben
> > > pointed out in RFC V1, so far we disable the SMM for resource
> > > consideration, drop the mmu notification as in this case the
> > > memory is pinned.

I'm excited to see big MMU changes like this, and I look forward to
combining our needs towards a better MMU for the x86 TDP case. Have
you thought about how you would build SMM and MMU notifier support
onto this patch series? I know that the invalidate range notifiers, at
least, added a lot of non-trivial complexity to the direct MMU
implementation I presented last year.

> > >
> > > V1->V2:
> > > * Rebase the code to kernel version 5.9.0-rc1.
> > >
> > > Yulei Zhang (9):
> > > Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > > support
> > > Introduce page table population function for direct build EPT feature
> > > Introduce page table remove function for direct build EPT feature
> > > Add release function for direct build ept when guest VM exit
> > > Modify the page fault path to meet the direct build EPT requirement
> > > Apply the direct build EPT according to the memory slots change
> > > Add migration support when using direct build EPT
> > > Introduce kvm module parameter global_tdp to turn on the direct build
> > > EPT mode
> > > Handle certain mmu exposed functions properly while turn on direct
> > > build EPT mode
> > >
> > > arch/mips/kvm/mips.c | 13 +
> > > arch/powerpc/kvm/powerpc.c | 13 +
> > > arch/s390/kvm/kvm-s390.c | 13 +
> > > arch/x86/include/asm/kvm_host.h | 13 +-
> > > arch/x86/kvm/mmu/mmu.c | 533 ++++++++++++++++++++++++++++++--
> > > arch/x86/kvm/svm/svm.c | 2 +-
> > > arch/x86/kvm/vmx/vmx.c | 7 +-
> > > arch/x86/kvm/x86.c | 55 ++--
> > > include/linux/kvm_host.h | 7 +-
> > > virt/kvm/kvm_main.c | 43 ++-
> > > 10 files changed, 639 insertions(+), 60 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >