Re: [PATCH] kvm: x86: disable KVM_FAST_MMIO_BUS

From: Michael S. Tsirkin
Date: Wed Aug 16 2017 - 10:06:59 EST


On Wed, Aug 16, 2017 at 03:37:47PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 14:07, Radim KrÄmÃÅ wrote:
> > 2017-08-16 13:22+0200, Paolo Bonzini:
> >> Microsoft pointed out privately to me that KVM's handling of
> >> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid
> >> in EPT misconfiguration vmexit handlers, because neither EPT violations
> >> nor misconfigurations are listed in the manual among the VM exits that
> >> set the VM-exit instruction length field.
> >>
> >> While physical processors seem to set the field, this is not architectural
> >> and is just a side effect of the implementation. I couldn't convince
> >> myself of any condition on the exit qualification where VM-exit
> >> instruction length "has" to be defined; there are no trap-like VM-exits
> >> that can be repurposed; and fault-like VM-exits such as descriptor-table
> >> exits provide no decoding information. So I don't really see any elegant
> >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> >> 1 will go slower.
> >
> > Do you have some numbers?
>
> Raw number from vmexit.flat on Haswell-EP:
>
> mmio-no-eventfd:pci-mem 5793
> mmio-wildcard-eventfd:pci-mem 1395
> mmio-datamatch-eventfd:pci-mem 2268
>
> So roughly 900 clock cycles. Most of the work is the four memory reads
> done by x86_decode_insn, three to walk the page tables and one to fetch
> the instruction.
>
> > We could keep the ugliness in KVM and add a new skip function with
> > emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the
> > instruction. (Adding a condition just for EPT violation exit reason to
> > the existing skip function would be a dirtier solution.)
> > Slower than what we have now, but faster than full emulation.
>
> This is actually a good idea, and not ugly at all! The main cost is
> translating the physical address of the instruction and fetching the
> bytes, so only 200 clock cycles are saved.

We actually know what to expect (a write) so we could maybe
optimize this some more with a dedicated function just for this.

>
> However, the eventfd is written before decoding, while full emulation
> would write it after. So while VCPU thread latency is worse compared to
> skip_emulated_instruction, latency to the iothread remains small.
>
> Paolo