Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

From: Maciej S. Szmigiero
Date: Thu Apr 07 2022 - 11:33:04 EST


On 7.04.2022 01:03, Sean Christopherson wrote:
On Thu, Apr 07, 2022, Maciej S. Szmigiero wrote:
On 6.04.2022 22:52, Sean Christopherson wrote:
On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
Another option for saving and restoring a VM would be to add it to
KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
control area?).

Ooh. What if we keep nested_run_pending=true until the injection completes? Then
we don't even need an extra flag because nested_run_pending effectively says that
any and all injected events are for L1=>L2. In KVM_GET_NESTED_STATE, shove the
to-be-injected event into the normal vmc*12 injection field, and ignore all
to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.

That should work even for migrating to an older KVM, as keeping nested_run_pending
will cause the target to reprocess the event injection as if it were from nested
VM-Enter, which it technically is.

I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
mean *moving* back the L1 -> L2 event to be injected from KVM internal data
structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
VMCB12 EVENTINJ field (or its VMX equivalent).

But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
these events as pending).
And their setters in the opposite order when restoring the VM.

I wasn't thinking of actually moving things in the source VM, only ignoring events
in KVM_GET_VCPU_EVENTS. Getting state shouldn't be destructive, e.g. the source VM
should still be able to continue running.

Right, getters should not change state.

Ahahahaha, and actually looking at the code, there's this gem in KVM_GET_VCPU_EVENTS

/*
* The API doesn't provide the instruction length for software
* exceptions, so don't report them. As long as the guest RIP
* isn't advanced, we should expect to encounter the exception
* again.
*/
if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
events->exception.injected = 0;
events->exception.pending = 0;
}

and again for soft interrupts

events->interrupt.injected =
vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;

so through KVM's own incompetency, it's already doing half the work.

So KVM_GET_VCPU_EVENTS was basically already explicitly broken for this
case (where RIP does not point to a INT3/INTO/INT x instruction).

This is roughly what I had in mind. It will "require" moving nested_run_pending
to kvm_vcpu_arch, but I've been itching for an excuse to do that anyways.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb71727acecb..62c48f6a0815 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4846,6 +4846,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
struct kvm_vcpu_events *events)
{
+ bool drop_injected_events = vcpu->arch.nested_run_pending;
+
process_nmi(vcpu);

if (kvm_check_request(KVM_REQ_SMI, vcpu))
@@ -4872,7 +4874,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
* isn't advanced, we should expect to encounter the exception
* again.
*/
- if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
+ if (drop_injected_events ||
+ kvm_exception_is_soft(vcpu->arch.exception.nr)) {
events->exception.injected = 0;
events->exception.pending = 0;
} else {
@@ -4893,13 +4896,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
events->exception_has_payload = vcpu->arch.exception.has_payload;
events->exception_payload = vcpu->arch.exception.payload;

- events->interrupt.injected =
- vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
+ events->interrupt.injected = vcpu->arch.interrupt.injected &&
+ !vcpu->arch.interrupt.soft &&
+ !drop_injected_events;
events->interrupt.nr = vcpu->arch.interrupt.nr;
events->interrupt.soft = 0;
events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);

- events->nmi.injected = vcpu->arch.nmi_injected;
+ events->nmi.injected = vcpu->arch.nmi_injected && !drop_injected_events;
events->nmi.pending = vcpu->arch.nmi_pending != 0;
events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu);
events->nmi.pad = 0;


So the VMM will get VMCB12 with EVENTINJ field filled with the event
to re-inject from KVM_GET_NESTED_STATE and events.{exception,interrupt,nmi}.injected
unset from KVM_GET_VCPU_EVENTS.

Let's say now that the VMM uses this data to restore a VM: it restores nested
state by using KVM_SET_NESTED_STATE and then events by calling KVM_SET_VCPU_EVENTS.

So it ends with VMCB12 EVENTINJ field filled, but KVM injection structures
(arch.{exception,interrupt,nmi}.injected) zeroed by that later KVM_SET_VCPU_EVENTS
call.

Assuming that L1 -> L2 event injection is always based on KVM injection structures
(like we discussed earlier), rather than on a direct copy of EVENTINJ field like
it is now, before doing the first VM entry KVM will need to re-parse VMCB12 EVENTINJ
field into KVM arch.{exception,interrupt,nmi}.injected to make it work properly.

Thanks,
Maciej