Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

From: Vitaly Kuznetsov
Date: Fri Mar 29 2019 - 05:14:52 EST


Liran Alon <liran.alon@xxxxxxxxxx> writes:

>> On 28 Mar 2019, at 22:31, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote:
>>
>> This is embarassing but we have another Windows/Hyper-V issue to workaround
>> in KVM (or QEMU). Hope "RFC" makes it less offensive.
>>
>> It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
>> 'piix4-usb-uhci' device is attached. The problem with this device is that
>> it uses level-triggered interrupts.
>>
>> The 'hang' scenario develops like this:
>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>> of them is level-triggered. KVM injects the edge-triggered one and
>> requests immediate exit to inject the level-triggered:
>>
>> kvm_set_irq: gsi 23 level 1 source 0
>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>
> There is no immediate-exit here.
> Here QEMU just set two pending irqs: vector 80 and vector 96.
> Because vCPU 0 is running at non-root-mode, KVM emulates an exit from L2 to L1 on EXTERNAL_INTERRUPT.
> Note that EXTERNAL_INTERRUPT is emulated on vector 0x60==96 which is the higher vector which is pending which is correct.
>
> BTW, I donât know why both are set in LAPIC as edge-triggered and not level-triggered.
> But it can be seen from trace pattern that these interrupts are both level-triggered. (See QEMUâs ioapic_service()).
> How did you deduce that one is edge-triggered and the other is
> level-triggered?

"kvm_apic_accept_irq" event misreports level-triggered interrupts as
edge-triggered, see my "KVM: x86: avoid misreporting level-triggered
irqs as edge-triggered in tracing" patch.

Other than that I double-checked, both in Qemu and KVM (and there's a
lot of additional debug prints stripped) and I'm certain there's no
disagreement anywhere: gsi 23/vec 80 is a level-triggered interrupt.

>
>>
>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>> immediate exit happens:
>>
>> kvm_entry: vcpu 0
>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>> kvm_entry: vcpu 0
>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>
> I assume that as part of Hyper-V VMExit handler for EXTERNAL_INTERRUPT, it will forward the interrupt to the host.
> As done in KVM vcpu_enter_guest() calling kvm_x86_ops->handle_external_intr().
> Because vmcs->vm_exit_intr_info specifies vector 96, we are still left with vector 80 pending.
>
> I also assume that Hyper-V utilise VM_EXIT_ACK_INTR_ON_EXIT and thus vector 96 is cleared from LAPIC IRR
> and the bit in LAPIC ISR for vector 96 is set.
> This is emulated by L0 KVM at nested_vmx_vmexit() -> kvm_cpu_get_interrupt().
>
> I further assume that at the point that vector 96 runs in L1, interrupts are disabled.
> Afterwards I would expect L1 to enable interrupts (Similar to vcpu_enter_guest() calling local_irq_enable() after kvm_x86_ops->handle_external_intr()).
>
> I would expect Hyper-V handler for vector 96 at some point to do EOI such that when interrupts are later enabled, vector 80 will also get injected.
> All of this before attempting to resume back into L2.
>
> However, it can be seen that indeed at this resume, you receive, after an immediate-exit, an injection of EXTERNAL_INTERRUPT on vector 0x50==80.
> As if Hyper-V never enabled interrupts after handling vector 96 before doing a resume again to L2.
>
> This is still valid of course but just a bit bizarre and
> inefficient. Oh well. :)

Reverse-engineering is always fun :-)

>
>>
>> 3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
>> process the first one)
>
> Both interrupts are for L1 not L2.
>
>> so it just does 'EOI' for level-triggered interrupt
>> and this causes re-injection:
>>
>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
>> kvm_eoi: apicid 0 vector 80
>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
>> kvm_set_irq: gsi 23 level 1 source 0
>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>> kvm_entry: vcpu 0
>
> What happens here is that Hyper-V as a response to second EXTERNAL_INTERRUPT on vector 80,
> it invokes vector 80 handler which performs EOI which is configured in ioapic_exit_bitmap to cause EOI_INDUCED exit to L0.
> The EOI_INDUCED handler will reach handle_apic_eoi_induced() -> kvm_apic_set_eoi_accelerated() -> kvm_ioapic_send_eoi() -> kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT),
> which will cause the exit on KVM_EXIT_IOAPIC_EOI to QEMU as required.
>
> As part of QEMU handling for this exit (ioapic_eoi_broadcast()), it will note that pinâs irr is still set (irq-line was not lowered by vector 80 interrupt handler before EOI),
> and thus vector 80 is re-injected by IOAPIC at ioapic_service().
>
> If this is indeed a level-triggered interrupt, then it seems buggy to me that vector 80 handler havenât lowered the irq-line before EOI.
> I would suggest adding a trace to QEMUâs ioapic_set_irq() for when vector=80 and level=0 and ioapic_eoi_broadcast() to verify if this is what happens.
> Another option is that even though vector 80 handler lowered irq-line, it got re-set by device between when it lowered the irq-line and until it did an EOI. Which is legit.
>
> If the former is what happens and vector 80 handler havenât lowered
> the irq-line before EOI, this is the real root-cause of your issue at
> hand here.

Yes, exactly - and sorry if I wasn't clear about that. Hyper-V does EOI
for a still pending level-triggered interrupt (as it can't actually fix
the reason - in needs Windows partition - L2 VM - to run and do the
job).

As a pure speculation I can guess: they don't want to keep a queue of
interrupts which need L2 handling and they count on the fact that doing
EOI with interrupts disabled in L1 -> VMRESUME will actually allow L2 VM
to run for some time and make some progress.

>
>>
>> 4) As we arm preemtion timer again we get stuck in the infinite loop:
>>
>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>> kvm_entry: vcpu 0
>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>> kvm_entry: vcpu 0
>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
>> kvm_eoi: apicid 0 vector 80
>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
>> kvm_set_irq: gsi 23 level 1 source 0
>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>> kvm_entry: vcpu 0
>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>> ...
>>
>> How does this work on hardware? I don't really know but it seems that we
>> were dealing with similar issues before, see commit 184564efae4d7 ("kvm:
>> ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI
>> doesn't always cause an *immediate* interrupt re-injection some progress
>> can be made.
>
> This commit tries to handle with a bug exactly as I have described above.
> In case a level-triggered interrupt handler in guest performs EOI before it lowers the irq-line, it will be re-injected in a loop forever with the level-triggered interrupt.
> The reason this doesnât happen on real hardware is because there is a very short delay between when an EOI is performed to IOAPIC and until IOAPIC attempts to serve a new interrupt.
>
> To handle this, this commit delays the evaluation of IOAPIC new pending interrupts when it receives an EOI for a level-triggered interrupt.
>
>>
>> An alternative approach would be to port the above mentioned commit to
>> QEMU's ioapic implementation. I'm, however, not sure that level-triggered
>> interrupts is a must to trigger the issue.
>
> This seems like the right solution to me.
> It seems to me that I have actually pinpointed it in this analysis
> directly to level-triggered interrupts and it is a must to trigger the
> issue.

Yes, in this particular case it is directly linked to (mis-)handling of
a level-triggered interrupt.

>
> This also explains why the issue doesnât reproduce when you are using in-kernel irqchip.
> It also explains why "piix4-usb-uhciâ must be present to re-produce
> this issue as probably the bug is in that device interrupt handler.

Level-triggered interrupts are rather rare but still present, I think
that any device using them can cause similar hang.

> (We should probably report this bug to Microsoft. Do we have any
> contacts for this?)

Yes, I just wanted to get KVM people's opinion first.

>
> Do you agree with above analysis?

Thank you for it! I think we're in violent agreement here :-)

> -Liran
>
>>
>> HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't
>> looked at SVM yet but their immediate exit implementation does
>> smp_send_reschedule(), I'm not sure this can cause the above described
>> lockup.
>>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
>
> As mentioned above, I think this patch should be dropped.
>

Yes, of course, I never expected it to be merged. I am, however, afraid
that we'll have to get back to this discussion if another scenario with
'immediate eoi storm' appears. I can't think of anything in particular
right away so I'm going to implement the throttling for level-triggred
interrupt re-assertion in Qemu first.

>> ---
>> arch/x86/kvm/vmx/vmcs.h | 2 ++
>> arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
>> 2 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
>> index cb6079f8a227..983dfc60c30c 100644
>> --- a/arch/x86/kvm/vmx/vmcs.h
>> +++ b/arch/x86/kvm/vmx/vmcs.h
>> @@ -54,6 +54,8 @@ struct loaded_vmcs {
>> bool launched;
>> bool nmi_known_unmasked;
>> bool hv_timer_armed;
>> + unsigned long hv_timer_lastrip;
>> + int hv_timer_lastrip_cnt;
>> /* Support for vnmi-less CPUs */
>> int soft_vnmi_blocked;
>> ktime_t entry_time;
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 617443c1c6d7..8a49ec14dd3a 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
>> vmx->loaded_vmcs->hv_timer_armed = true;
>> }
>>
>> +#define HV_TIMER_THROTTLE 100
>> +#define HV_TIMER_DELAY 1000
>> +
>> static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> u64 tscl;
>> u32 delta_tsc;
>>
>> + /*
>> + * Some guests are buggy and may behave in a way that causes KVM to
>> + * always request immediate exit (e.g. of a nested guest). At the same
>> + * time guest progress may be required to resolve the condition.
>> + * Throttling below makes sure we never get stuck completely.
>> + */
>> if (vmx->req_immediate_exit) {
>> - vmx_arm_hv_timer(vmx, 0);
>> + unsigned long rip = kvm_rip_read(vcpu);
>> +
>> + if (vmx->loaded_vmcs->hv_timer_lastrip == rip) {
>> + ++vmx->loaded_vmcs->hv_timer_lastrip_cnt;
>> + } else {
>> + vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0;
>> + vmx->loaded_vmcs->hv_timer_lastrip = rip;
>> + }
>> +
>> + if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE)
>> + vmx_arm_hv_timer(vmx, HV_TIMER_DELAY);
>> + else
>> + vmx_arm_hv_timer(vmx, 0);
>> +
>> return;
>> }
>>
>> --
>> 2.20.1
>>
>

--
Vitaly