Re: [PATCH v2 10/11] KVM: SVM: implement support for vNMI

From: Santosh Shukla
Date: Fri Feb 10 2023 - 07:24:38 EST


On 2/1/2023 5:52 AM, Sean Christopherson wrote:
> On Tue, Nov 29, 2022, Maxim Levitsky wrote:
>> This patch implements support for injecting pending
>> NMIs via the .kvm_x86_set_hw_nmi_pending using new AMD's vNMI
>
> Wrap closer to 75 chars, and please try to be consistent in how your format
> things like changelogs and comments. It's jarring as a reader when the wrap
> column is constantly changing.
>
>> feature.
>
> Please combine the introduction, usage, and implementation of the hew kvm_x86_ops,
> i.e. introduce and use the ops in this patch. It's extremely difficult to review
> the common x86 code that uses the ops without seeing how they're implemented in
> SVM. I believe the overall size/scope of the patch can be kept reasonable by
> introducing some of the common changes in advance of the new ops, e.g. tweaking
> the KVM_SET_VCPU_EVENTS flow.
>
>> Note that the vNMI can't cause a VM exit, which is needed
>> when a nested guest intercepts NMIs.
>
> I can't tell if this is saying "SVM doesn't allow intercepting virtual NMIs", or
> "KVM never enables virtual NMI interception".
>

I think, It meant to say that vNMI doesn't need nmi_window_exiting feature to
pend the new virtual NMI. Will reword.

>> Therefore to avoid breaking nesting, the vNMI is inhibited while
>> a nested guest is running and instead, the legacy NMI window
>> detection and delivery method is used.
>
> State what KVM does, don't describe the effects. E.g. "Disable vNMI while running
> L2". When a changelog describes the effects, it's unclear whether the effects are
> new behavior introduced by the patch, hardware behavior, etc...
>
>> While it is possible to passthrough the vNMI if a nested guest
>> doesn't intercept NMIs, such usage is very uncommon, and it's
>> not worth to optimize for.
>
> Can you elaborate on why not? It's not obvious to me that the code will end up
> more complex, and oftentimes omitting code increases the cognitive load on readers,
> i.e. makes things more complex in a way. vNMI is mutually exclusive with NMI
> passthrough, i.e. KVM doesn't need to handle NMI passthrough and vNMI simultaneously.
>
>> Signed-off-by: Santosh Shukla <santosh.shukla@xxxxxxx>
>> Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
>
> SoB chain is wrong. Maxim is credited as the sole Author, i.e. Santosh shouldn't
> have a SoB. Assuming the intent is to attribute both of ya'll this needs to be
>
> Co-developed-by: Santosh Shukla <santosh.shukla@xxxxxxx>
> Signed-off-by: Santosh Shukla <santosh.shukla@xxxxxxx>
> Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
>
> if Maxim is the primary author, or this if Santosh is the primary author
>
> From: Santosh Shukla <santosh.shukla@xxxxxxx>
>
> <blah blah blah>
>
> Signed-off-by: Santosh Shukla <santosh.shukla@xxxxxxx>
> Developed-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
>

Will sort those in v3.

>> ---
>> arch/x86/kvm/svm/nested.c | 42 +++++++++++++++
>> arch/x86/kvm/svm/svm.c | 111 ++++++++++++++++++++++++++++++--------
>> arch/x86/kvm/svm/svm.h | 10 ++++
>> 3 files changed, 140 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index e891318595113e..5bea672bf8b12d 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -623,6 +623,42 @@ static bool is_evtinj_nmi(u32 evtinj)
>> return type == SVM_EVTINJ_TYPE_NMI;
>> }
>>
>> +static void nested_svm_save_vnmi(struct vcpu_svm *svm)
>
> Please avoid save/restore names. KVM (selftests in particular) uses save/restore
> to refer to a saving and restoring state across a migration. "sync" is probably
> the best option, or just open code the flows.
>

ok.

I chose to open code that way I don't need to consider using svm->nmi_masked which should be
used for non-vNMI case.

>> +{
>> + struct vmcb *vmcb01 = svm->vmcb01.ptr;
>> +
>> + /*
>> + * Copy the vNMI state back to software NMI tracking state
>> + * for the duration of the nested run
>> + */
>> +
>> + svm->nmi_masked = vmcb01->control.int_ctl & V_NMI_MASK;
>> + svm->vcpu.arch.nmi_pending += vmcb01->control.int_ctl & V_NMI_PENDING;
>
> This is wrong. V_NMI_PENDING is bit 11, i.e. the bitwise-AND does not yield a
> boolean value and will increment nmi_pending by 2048 instead of by 1.
>

Right.

> if (vmcb01->control.int_ctl & V_NMI_PENDING)
> svm->vcpu.arch.nmi_pending++;
>
> And this needs a KVM_REQ_EVENT to ensure KVM processes the newly pending NMI.
>

Ok.

>> +}
>> +
>> +static void nested_svm_restore_vnmi(struct vcpu_svm *svm)
>> +{
>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>> + struct vmcb *vmcb01 = svm->vmcb01.ptr;
>> +
>> + /*
>> + * Restore the vNMI state from the software NMI tracking state
>> + * after a nested run
>> + */
>> +
>> + if (svm->nmi_masked)
>> + vmcb01->control.int_ctl |= V_NMI_MASK;
>> + else
>> + vmcb01->control.int_ctl &= ~V_NMI_MASK;
>
> As proposed, this needs to clear nmi_masked to avoid false positives. The better
> approach is to not have any open coded accesses to svm->nmi_masked outside of
> flows that specifically need to deal with vNMI logic.
>
ok.

> E.g. svm_enable_nmi_window() reads the raw nmi_masked.
>
>> +
>> + if (vcpu->arch.nmi_pending) {
>> + vcpu->arch.nmi_pending--;
>> + vmcb01->control.int_ctl |= V_NMI_PENDING;
>> + } else
>
> Curly braces on all paths if any path needs 'em.
>

ok.

>> + vmcb01->control.int_ctl &= ~V_NMI_PENDING;
>> +}
>
> ...
>
>> + static bool svm_set_hw_nmi_pending(struct kvm_vcpu *vcpu)
>> +{
>> + struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> + if (!is_vnmi_enabled(svm))
>> + return false;
>> +
>> + if (svm->vmcb->control.int_ctl & V_NMI_PENDING)
>> + return false;
>> +
>> + svm->vmcb->control.int_ctl |= V_NMI_PENDING;
>> + vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
>> +
>> + /*
>> + * NMI isn't yet technically injected but
>> + * this rough estimation should be good enough
>
> Again, wrap at 80 chars, not at random points.
>

ok.

>> + */
>> + ++vcpu->stat.nmi_injections;
>> +
>> + return true;
>> +}
>> +
>
> ...
>
>> bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_svm *svm = to_svm(vcpu);
>> @@ -3725,10 +3772,16 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
>> /*
>> * Something prevents NMI from been injected. Single step over possible
>> * problem (IRET or exception injection or interrupt shadow)
>> + *
>> + * With vNMI we should never need an NMI window
>> + * (we can always inject vNMI either by setting VNMI_PENDING or by EVENTINJ)
>
> Please honor the soft limit and avoid pronouns. There's also no need to put the
> blurb in parantheses on its own line.
>
> As for the code, I believe there are bugs. Pulling in the context...
>
> if (svm->nmi_masked && !svm->awaiting_iret_completion)
> return; /* IRET will cause a vm exit */
>
> Checking nmi_masked is wrong, this should use the helper. Even if this code can

Right,.

> only be reached on error, it should still try its best to not make things worse.
>
> if (!gif_set(svm)) {
> if (vgif)
> svm_set_intercept(svm, INTERCEPT_STGI);
> return; /* STGI will cause a vm exit */
> }
>
> /*
> * Something prevents NMI from been injected. Single step over possible
> * problem (IRET or exception injection or interrupt shadow)
> *
> * With vNMI we should never need an NMI window
> * (we can always inject vNMI either by setting VNMI_PENDING or by EVENTINJ)
> */
> if (WARN_ON_ONCE(is_vnmi_enabled(svm)))
> return;
>
> This is flawed, where "this" means handling of NMI windows when vNMI is enabled.
>
> IIUC, if there are back-to-back NMIs, the intent is to set V_NMI for one and
> inject the other. I believe there are multiple bugs svm_inject_nmi(). The one
> that's definitely a bug is setting svm->nmi_masked. The other suspected bug,
> which is related to the above WARN, is setting the IRET intercept. The resulting
> IRET interception will set awaiting_iret_completion, and then the above WARN is
> reachable (even if the masking check is fixed).
>
> I don't think KVM needs to ever intercept IRET. One NMI gets injected, and the
> other is sitting in INT_CTL.V_NMI_PENDING, i.e. there's no need for KVM to regain
> control. If another NMI comes along before V_NMI_PENDING is handled, it will
> either get injected or dropped.
>
> So I _think_ KVM can skip the intercept code when injecting an NMI, and this WARN
> can be hoisted to the top of svm_enable_nmi_window(), because as stated above, KVM
> should never need to request an NMI window.
>
> Last thought, unless there's something that will obviously break, it's probably
> better to WARN and continue than to bail. I.e. do the single-step and hope for
> the best. Bailing at this point doesn't seem like it would help.
>
>> */
>> + if (WARN_ON_ONCE(is_vnmi_enabled(svm)))
>> + return;
>> +
>> svm->nmi_singlestep_guest_rflags = svm_get_rflags(vcpu);
>> - svm->nmi_singlestep = true;
>> svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
>> + svm->nmi_singlestep = true;
>> }
>

Ok.

So you mean.. In vNMI mode, KVM should never need to request NMI window and eventually
it reaches to NMI window then WARN_ON and cont.. to single step... so modified code change
may look something like below:

static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);

/*
* With vNMI we should never need an NMI window.
* and if we reach here then better WARN and continue to single step.
*/
WARN_ON_ONCE(is_vnmi_enabled(svm));

if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion)
return; /* IRET will cause a vm exit */

if (!gif_set(svm)) {
if (vgif)
svm_set_intercept(svm, INTERCEPT_STGI);
return; /* STGI will cause a vm exit */
}

/*
* Something prevents NMI from been injected. Single step over possible
* problem (IRET or exception injection or interrupt shadow)
*/

svm->nmi_singlestep_guest_rflags = svm_get_rflags(vcpu);
svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
svm->nmi_singlestep = true;
}

Does that make sense?

Thanks,
Santosh

> ...
>
>> @@ -553,6 +554,15 @@ static inline bool is_x2apic_msrpm_offset(u32 offset)
>> (msr < (APIC_BASE_MSR + 0x100));
>> }
>>
>> +static inline bool is_vnmi_enabled(struct vcpu_svm *svm)
>> +{
>> + /* L1's vNMI is inhibited while nested guest is running */
>> + if (is_guest_mode(&svm->vcpu))
>
> I would rather check the current VMCB. I don't see any value in hardcoding the
> "KVM doesn't support vNMI in L2" in multiple places. And I find the above comment
> about "L1's vNMI is inhibited" confusing. vNMI isn't inhibited/blocked, KVM just
> doesn't utilize vNMI while L2 is active (IIUC, as proposed).
>
>> + return false;
>> +
>> + return !!(svm->vmcb01.ptr->control.int_ctl & V_NMI_ENABLE);
>> +}
>> +
>> /* svm.c */
>> #define MSR_INVALID 0xffffffffU
>>
>> --
>> 2.26.3
>>