Re: [PATCH v7 8/8] KVM: VMX: enable IPI virtualization

From: Zeng Guang
Date: Fri Apr 08 2022 - 12:41:25 EST



On 4/5/2022 1:57 AM, Sean Christopherson wrote:
On Sun, Apr 03, 2022, Zeng Guang wrote:
On 4/1/2022 10:37 AM, Sean Christopherson wrote:
@@ -4219,14 +4226,21 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
if (cpu_has_secondary_exec_ctrls()) {
- if (kvm_vcpu_apicv_active(vcpu))
+ if (kvm_vcpu_apicv_active(vcpu)) {
secondary_exec_controls_setbit(vmx,
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
- else
+ if (enable_ipiv)
+ tertiary_exec_controls_setbit(vmx,
+ TERTIARY_EXEC_IPI_VIRT);
+ } else {
secondary_exec_controls_clearbit(vmx,
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+ if (enable_ipiv)
+ tertiary_exec_controls_clearbit(vmx,
+ TERTIARY_EXEC_IPI_VIRT);
Oof. The existing code is kludgy. We should never reach this point without
enable_apicv=true, and enable_apicv should be forced off if APICv isn't supported,
let alone seconary exec being support.

Unless I'm missing something, throw a prep patch earlier in the series to drop
the cpu_has_secondary_exec_ctrls() check, that will clean this code up a smidge.
cpu_has_secondary_exec_ctrls() check can avoid wrong vmcs write in case mistaken
invocation.
KVM has far bigger problems on buggy invocation, and in that case the resulting
printk + WARN from the failed VMWRITE is a good thing.


SDM doesn't define VMWRITE failure for such case. But it says the logical
processor operates as if all the secondary processor-based VM-execution
controls were 0 if "activate secondary controls" primary processor-based
VM-execution control is 0. So we may add WARN() to detect this kind of
buggy invocation instead.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 61e075e16c19..6c370b507b45 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4200,22 +4200,22 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
        struct vcpu_vmx *vmx = to_vmx(vcpu);

        pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
-       if (cpu_has_secondary_exec_ctrls()) {
-               if (kvm_vcpu_apicv_active(vcpu)) {
-                       secondary_exec_controls_setbit(vmx,
- SECONDARY_EXEC_APIC_REGISTER_VIRT |
- SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-                       if (enable_ipiv)
-                               tertiary_exec_controls_setbit(vmx,
- TERTIARY_EXEC_IPI_VIRT);
-               } else {
-                       secondary_exec_controls_clearbit(vmx,
- SECONDARY_EXEC_APIC_REGISTER_VIRT |
- SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-                       if (enable_ipiv)
- tertiary_exec_controls_clearbit(vmx,
- TERTIARY_EXEC_IPI_VIRT);
-               }
+
+       WARN(!cpu_has_secondary_exec_ctrls(),
+                    "VMX: unexpected vmwrite with inactive secondary exec controls");
+
+       if (kvm_vcpu_apicv_active(vcpu)) {
+               secondary_exec_controls_setbit(vmx,
+                             SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+               if (enable_ipiv)
+                       tertiary_exec_controls_setbit(vmx, TERTIARY_EXEC_IPI_VIRT);
+       } else {
+               secondary_exec_controls_clearbit(vmx,
+                             SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+               if (enable_ipiv)
+                       tertiary_exec_controls_clearbit(vmx, TERTIARY_EXEC_IPI_VIRT);
        }

        vmx_update_msr_bitmap_x2apic(vcpu);


+ */
+ if (vmx_can_use_ipiv(vcpu->kvm)) {
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(vcpu->kvm);
+
+ mutex_lock(&vcpu->kvm->lock);
+ err = vmx_alloc_pid_table(kvm_vmx);
+ mutex_unlock(&vcpu->kvm->lock);
This belongs in vmx_vm_init(), doing it in vCPU creation is a remnant of the
dynamic resize approach that's no longer needed.
We cannot allocate pid table in vmx_vm_init() as userspace has no chance to
set max_vcpu_ids at this stage. That's the reason we do it in vCPU creation
instead.
Ah, right. Hrm. And that's going to be a recurring problem if we try to use the
dynamic kvm->max_vcpu_ids to reduce other kernel allocations.

Argh, and even kvm_arch_vcpu_precreate() isn't protected by kvm->lock.

Taking kvm->lock isn't problematic per se, I just hate doing it so deep in a
per-vCPU flow like this.

A really gross hack/idea would be to make this 64-bit only and steal the upper
32 bits of @type in kvm_create_vm() for the max ID.

I think my first choice would be to move kvm_arch_vcpu_precreate() under kvm->lock.
None of the architectures that have a non-nop implemenation (s390, arm64 and x86)
do significant work, so holding kvm->lock shouldn't harm performance. s390 has to
acquire kvm->lock in its implementation, so we could drop that. And looking at
arm64, I believe its logic should also be done under kvm->lock.

It'll mean adding yet another kvm_x86_ops, but I like that more than burying the
code deep in vCPU creation.

Paolo, any thoughts on this?

Sounds reasonable. I will prepare patch to refactor the kvm_arch_vcpu_precreate()
and make pid table allocation done there.

Thanks.