Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02

From: Maxim Levitsky
Date: Mon Apr 04 2022 - 05:50:59 EST


On Fri, 2022-04-01 at 21:51 +0000, Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> > On 1.04.2022 20:32, Sean Christopherson wrote:
> > > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> > > > + /* The return address pushed on stack by the CPU for some injected events */
> > > > + svm->vmcb->control.next_rip = svm->nested.ctl.next_rip;
> > >
> > > This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.
> > >
> > > if (svm->nrips_enabled)
> > > vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> >
> > It can be done, however what if we run on a nrips-capable CPU,
> > but don't expose this capability to the L1?
>
> Oh, right, because the field will be populated by the CPU on VM-Exit. Ah, the
> correct behavior is to grab RIP from vmcb12 to emulate nrips=0 hardware simply
> not updating RIP. E.g. zeroing it out would send L2 into the weeds on IRET due
> the CPU pushing '0' on the stack when vectoring the injected event.
>
> if (svm->nrips_enabled)
> vmcb02->control.next_rip = svm->nested.ctl.next_rip;
> else if (boot_cpu_has(X86_FEATURE_NRIPS))
> vmcb02->control.next_rip = vmcb12_rip;
>
> > The CPU will then push whatever value was left in this field as
> > the return address for some L1 injected events.

This makes sense.

Note that even AMD's PRM has a note about this:

"
15.20 Event Injection
...
Software interrupts cannot be properly injected if the processor does not support the NextRIP field.
Support is indicated by CPUID Fn8000_000A_EDX[NRIPS] = 1. Hypervisor software should
emulate the event injection of software interrupts if NextRIP is not supported
"



> >
> > Although without nrips feature the L1 shouldn't even attempt event
> > injection, copying this field anyway will make it work if L1 just
> > expects this capability based on the current CPU model rather than
> > by checking specific CPUID feature bits.

The guest really ought to check CPUID bits. Plus the CPU model is also
usually virtualized (for named machine types in Qemu for example).

>
> L1 may still inject the exception, it just advances the RIP manually. As above,
> the really messy thing is that, because there's no flag to say "don't use NextRIP!",
> the CPU will still consume NextRIP and push '0' on the stack for the return RIP
> from the INTn/INT3/INTO. Yay.
>
> I found that out the hard way (patch in-progress). The way to handle event
> injection if KVM is loaded with nrips=0 but nrips is supported in hardware is to
> stuff NextRIP on event injection even if nrips=0, otherwise the guest is hosed.
>
> > > > + u64 next_rip;
> > > > u64 nested_cr3;
> > > > u64 virt_ext;
> > > > u32 clean;
> > >
> > > I don't know why this struct has
> > >
> > > u8 reserved_sw[32];
> > >
> > > but presumably it's for padding, i.e. probably should be reduced to 24 bytes.
> >
> > Apparently the "reserved_sw" field stores Hyper-V enlightenments state -
> > see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature")
> > and nested_svm_vmrun_msrpm() in nested.c.
>
> Argh, that's a terrible name. Thanks for doing the homework, I was being lazy.


That was added around the commit

1183646a67d01 ("KVM: SVM: hyper-v: Direct Virtual Flush support")

Seems to be used by HV to store 'struct hv_enlightenments',
but I don't know 100% if that is the only thing that can be stored
in this area.



Best regards,
Maxim Levitsky


>