Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

From: Raj, Ashok
Date: Thu May 07 2020 - 08:18:54 EST


Hi Thomas

We did a bit more tracing and it looks like the IRR check is actually
not happening on the right cpu. See below.

On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
> >
> > msi_set_affinit ()
> > {
> > ....
> > unlock_vector_lock();
> >
> > /*
> > * Check whether the transition raced with a device interrupt and
> > * is pending in the local APICs IRR. It is safe to do this outside
> > * of vector lock as the irq_desc::lock of this interrupt is still
> > * held and interrupts are disabled: The check is not accessing the
> > * underlying vector store. It's just checking the local APIC's
> > * IRR.
> > */
> > if (lapic_vector_set_in_irr(cfg->vector))
> > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>
> No. This catches the transitional interrupt to the new vector on the
> original CPU, i.e. the one which is running that code.

Mathias added some trace to his xhci driver when the isr is called.

Below is the tail of my trace with last two times xhci_irq isr is called:

<idle>-0 [003] d.h. 200.277971: xhci_irq: xhci irq
<idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq

Just trying to follow your steps below with traces. The traces follow
the same comments in the source.

>
> Again the steps are:
>
> 1) Allocate new vector on new CPU

/* Allocate a new target vector */
ret = parent->chip->irq_set_affinity(parent, mask, force);

migration/3-24 [003] d..1 200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0

>
> 2) Set new vector on original CPU

/* Redirect it to the new vector on the local CPU temporarily */
old_cfg.vector = cfg->vector;
irq_msi_update_msg(irqd, &old_cfg);

migration/3-24 [003] d..1 200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6
>
> 3) Set new vector on new CPU

/* Now transition it to the target CPU */
irq_msi_update_msg(irqd, cfg);

migration/3-24 [003] d..1 200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33



if (lapic_vector_set_in_irr(cfg->vector))
irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


migration/3-24 [003] d..1 200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0

>
> So we have 3 points where an interrupt can fire:
>
> A) Before #2
>
> B) After #2 and before #3
>
> C) After #3
>
> #A is hitting the old vector which is still valid on the old CPU and
> will be handled once interrupts are enabled with the correct irq
> descriptor - Normal operation (same as with maskable MSI)
>
> #B This must be checked in the IRR because the there is no valid vector
> on the old CPU.

The check for IRR seems like on a random cpu3 vs checking for the new vector 33
on old cpu 6?

This is the place when we force the retrigger without the IRR check things seem to fix itself.

>
> #C is handled on the new vector on the new CPU
>


Did we miss something?

Cheers,
Ashok