Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding

From: Dmytro Maluka
Date: Wed Aug 10 2022 - 13:02:55 EST


Hi Marc,

On 8/10/22 3:01 PM, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 09:12:18 +0100,
> Eric Auger <eric.auger@xxxxxxxxxx> wrote:
>>
>> Hi Marc,
>>
>> On 8/10/22 08:51, Marc Zyngier wrote:
>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>> Dmytro Maluka <dmy@xxxxxxxxxxxx> wrote:
>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dmytro Maluka <dmy@xxxxxxxxxxxx>
>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>> To: Dong, Eddie <eddie.dong@xxxxxxxxx>; Christopherson,, Sean
>>>>>> <seanjc@xxxxxxxxxx>; Paolo Bonzini <pbonzini@xxxxxxxxxx>;
>>>>>> kvm@xxxxxxxxxxxxxxx
>>>>>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>; Ingo Molnar <mingo@xxxxxxxxxx>;
>>>>>> Borislav Petkov <bp@xxxxxxxxx>; Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>;
>>>>>> x86@xxxxxxxxxx; H. Peter Anvin <hpa@xxxxxxxxx>; linux-
>>>>>> kernel@xxxxxxxxxxxxxxx; Eric Auger <eric.auger@xxxxxxxxxx>; Alex
>>>>>> Williamson <alex.williamson@xxxxxxxxxx>; Liu, Rong L <rong.l.liu@xxxxxxxxx>;
>>>>>> Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx>; Tomasz Nowicki
>>>>>> <tn@xxxxxxxxxxxx>; Grzegorz Jaszczyk <jaz@xxxxxxxxxxxx>;
>>>>>> upstream@xxxxxxxxxxxx; Dmitry Torokhov <dtor@xxxxxxxxxx>
>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>
>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>> unacknowledged IRQ by the host.
>>>>>>> Interesting... How it behaviors in native side?
>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>> completes.
>>>>>>
>>>>>> In handle_fasteoi_irq():
>>>>>>
>>>>>> if (desc->istate & IRQS_ONESHOT)
>>>>>> mask_irq(desc);
>>>>>>
>>>>>> handle_irq_event(desc);
>>>>>>
>>>>>> cond_unmask_eoi_irq(desc, chip);
>>>>>>
>>>>>>
>>>>>> and later in unmask_threaded_irq():
>>>>>>
>>>>>> unmask_irq(desc);
>>>>>>
>>>>>> I also mentioned that in patch #3 description:
>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>> That makes sense. Can you include the full story in cover letter too?
>>>> Ok, I will.
>>>>
>>>>>
>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>> and queued for injection to the guest."
>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>> this), but why shouldn't the resamplefd be notified? If there has been
>> yeah sorry to get you involved here ;-)
>
> No problem!
>
>>> an EOI, a new level must be made visible to the guest interrupt
>>> controller, no matter what the state of the interrupt masking is.
>>>
>>> Whether this new level is actually *presented* to a vCPU is another
>>> matter entirely, and is arguably a problem for the interrupt
>>> controller emulation.
>>
>> FWIU on guest EOI the physical line is still asserted so the pIRQ is
>> immediatly re-sampled by the interrupt controller (because the
>> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
>> (although it is masked at guest level). When the guest actually unmasks
>> the vIRQ we do not get a chance to re-evaluate the physical line level.
>
> Indeed, and maybe this is what should be fixed instead of moving the
> resampling point around (I was suggesting something along these lines
> in [1]).
>
> We already do this on arm64 for the timer, and it should be easy
> enough it generalise to any interrupt backed by the GIC (there is an
> in-kernel API to sample the pending state). No idea how that translate
> for other architectures though.

Actually I'm now thinking about changing the behavior implemented in my
patchset, which is:

1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
that no new physical IRQ is generated, and the vIRQ is not set as
pending.

2. After this vIRQ is unmasked by the guest, notify resamplefd.

to the following one:

1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
but also remember this vIRQ as, let's call it, "pending oneshot".

2. A new physical IRQ is immediately generated, so the vIRQ is
properly set as pending.

3. After the vIRQ is unmasked by the guest, check and find out that
it is not just pending but also "pending oneshot", so don't
deliver it to a vCPU. Instead, immediately notify resamplefd once
again.

In other words, don't avoid extra physical interrupts in the host
(rather, use those extra interrupts for properly updating the pending
state of the vIRQ) but avoid propagating those extra interrupts to the
guest.

Does this sound reasonable to you?

Your suggestion to sample the pending state of the physical IRQ sounds
interesting too. But as you said, it's yet to be checked how feasible it
would be on architectures other than arm64. Also it assumes that the IRQ
in question is a forwarded physical interrupt, while I can imagine that
KVM's resamplefd could in principle also be useful for implementing
purely emulated interrupts.

Do you see any advantages of sampling the physical IRQ pending state
over remembering the "pending oneshot" state as described above?

>
> M.
>
> [1] https://lore.kernel.org/r/87mtccbie4.wl-maz@xxxxxxxxxx
>