Re: [RFC PATCH] irqchip/gic-v3: Do not enable irqs when handling spurious interrups

From: Marc Zyngier
Date: Fri Apr 23 2021 - 04:08:31 EST


On Fri, 23 Apr 2021 04:29:44 +0100,
He Ying <heying24@xxxxxxxxxx> wrote:

[...]

> >>>> I look into this issue and find that it's caused by 'BUG_ON(in_nmi())'
> >>>> in nmi_enter(). From the call trace, we find two 'el1_irqs' which
> >>>> means an interrupt preempts the other one and the new one is an NMI.
> >>>> Furthermore, by adding some prints, we find the first irq also calls
> >>>> nmi_enter(), but its priority is not GICD_INT_NMI_PRI and its irq number
> >>>> is 1023. It enables irq by calling gic_arch_enable_irqs() in
> >>>> gic_handle_irq(). At this moment, the second irq preempts the first irq
> >>>> and it's an NMI but current context is already in nmi. So that may be
> >>>> the problem.
> >>> I'm not sure I get it. From the stack trace, I see this:
> >>>
> >>> [ 14.816251] asm_nmi_enter+0x94/0x98
> >>> [ 14.816251] el1_irq+0x8c/0x180 (C)
> >>> [ 14.816252] gic_handle_irq+0xbc/0x2e4
> >>> [ 14.816252] el1_irq+0xcc/0x180 (B)
> >>> [ 14.816253] arch_timer_handler_virt+0x38/0x58
> >>> [ 14.816253] handle_percpu_devid_irq+0x90/0x240
> >>> [ 14.816253] generic_handle_irq+0x34/0x50
> >>> [ 14.816254] __handle_domain_irq+0x68/0xc0
> >>> [ 14.816254] gic_handle_irq+0xf8/0x2e4
> >>> [ 14.816255] el1_irq+0xcc/0x180 (A)
> >>>
> >>> which indicates that we preempted a timer interrupt (A) with another
> >>> IRQ (B), itself immediately preempted by another IRQ (C)? That's
> >>> indeed at least one too many.
> >>>
> >>> Can you please describe for each of (A), (B) and (C) whether they are
> >>> spurious or not, what their priorities are if they aren't spurious?
> >> Yes. I ignored interrupt (A). (B) is spurious and its priority is
> >> 0xa0 and PMR is 0x70. (C) is an NMI and its priority is 0x20. Note
> >> that GIC_PRIO_IRQON is 0xe0, GIC_PRIO_IRQOFF is 0x60,
> >> GICD_INT_DEF_PRI is 0xa0 and GICD_INT_NMI_PRI is 0x20 in our kernel.
> > If (B) is spurious (aka ICC_IAR1R_EL1 return 1023), then its
> > "priority" doesn't really exist, and I don't really get what you mean
> > by "its priority is 0xa0". ICC_RPR_EL1 shouldn't change when Ack-ing
> > a spurious interrupt, because there is no change in GIC state at all.
> OK. By saying "its priority is 0xa0", I just mean ICC_RPR_EL1 is read
> as 0xa0.

Right. That's very different from saying that an interrupt has
priority 0xA0.

> >
> > And if PMR is 0x70 at the point where you get (B), then I really can't
> > see how you can get an interrupt of priority 0xa0 anyway.
>
> Yes, it also confuses me. Perhaps ICC_RPR_EL1 changes when read, I
> think.

No, it just means that we were already in an interrupt context when we
handled the spurious interrupt: I think interrupt (A) above was being
handled with priority 0xa0, preempted by a spurious interrupt (B) which
did the wrong thing by messing with the NMI state.

Probably (B) was an NMI that got retired early and presented again to
the cpuif as (C). If that's the case, the GIC implementation may be a
bit flaky. But the driver definitely has a bug.

[...]

> > I believe the above patch would fix the spurious interrupt issue you
> > have experienced. Please let me know, and post a v2 if this works for
> > you.
>
> Fortunately, it works. I'll post a v2. Should I cc stable mail list?

Yes please.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.