Re: Unhandled IRQs on AMD E-450

From: Jeroen Van den Keybus
Date: Sun Dec 11 2011 - 10:28:46 EST


> So the IRQ _does_ get unstuck eventually; I didn't expact that.

It would nevertheless make sense that the designer of the I/O-APIC
would have implemented a reasonable timeout for INTx-Deassert
reception. Perhaps that's what we see.

> So either the ASM1083 delays its Deassert messages, or it is just way
> too slow to react to changes in its PCI interrupt line inputs.

I'm afraid the only sensible thing to find this out would be to
somehow monitor the PCIe link traffic into the FCH from this ASM1083.
Maybe someone from AMD knows if this can be done ? Let's not forget
that the board seems to run fine under the Windows 7 O/S and maybe
Linux simply doesn't do a special trick with the bridge or the chipset
that Windows does. So, without further evidence, I would not (yet)
blame the bridge.

> I'd guess that you can make the pollig time shorter; a few milliseconds
> should be enough.

I tested the patch for a while now. I indeed decreased the polling
interval to 10 ms (100 Hz), and the IRQ is already enabled after 1
second (100 cycles). It works to a degree that the computer system
actually becomes useful. Under heavy use, the patch kicks in up to 10
times a minute. Otherwise it only is required a few times per hour. I
also turn off polling entirely when it is no longer needed.

Specifically for the Asus E45M1-M PRO I would recommend:

1. The IRQ bug manifests itself when using any device behind the
ASM1083 bridge. That includes the 2 PCI slots on the motherboard, as
well as the Firewire interface. Avoid their use. Preferably use the
PCIe x1 slot.

2. An important problem is that, when IRQ 16..19 goes down, an
integrated device, which otherwise works flawlessly, goes along with
it. This includes the SATA, USB and both audio (HDMI / Analog)
subsystems. If possible, enable the use of MSI for these devices.
Clemens's patch for AHCI MSI is a real help here.

3. Step 1 above will practically eliminate the occurrence of the IRQ
bug. If the PCI bus really is needed, the patch below must be used
(with the kernel irqpoll command line option turned on, of course).

> Your patch might be useful to others afflicted with this chip.  Could
> you publish it?

No problem, but I've never done this before. Is the result of diff
below ok ? Could someone specialized also have a look into the
thread-safety ?


J.


(Begin of patch for kernel/irq/spurious.c)

21c21
< #define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10)
---
> #define POLL_SPURIOUS_IRQ_INTERVAL (HZ/100)
144c144
< int i;
---
> int i, poll_again;
149a150
> poll_again = 0; /* Will stay false as long as no polling candidate is found */
151c152
< unsigned int state;
---
> unsigned int state, irq;
161,164c162,182
<
< local_irq_disable();
< try_one_irq(i, desc, true);
< local_irq_enable();
---
>
> /* We end up here with a disabled spurious interrupt.
> desc->irqs_unhandled now tracks the number of times
> the interrupt has been polled */
>
> irq = desc->irq_data.irq;
> if (desc->irqs_unhandled < 100) { /* 1 second delay with poll frequency 100 Hz */
> if (desc->irqs_unhandled == 0)
> printk("Polling IRQ %d\n", irq);
> local_irq_disable();
> try_one_irq(i, desc, true);
> local_irq_enable();
> desc->irqs_unhandled++;
> poll_again = 1;
> } else {
> printk("Reenabling IRQ %d\n", irq);
> irq_enable(desc); /* Reenable the interrupt line */
> desc->depth--;
> desc->istate &= (~IRQS_SPURIOUS_DISABLED);
> desc->irqs_unhandled = 0;
> }
165a184,186
> if (poll_again)
> mod_timer(&poll_spurious_irq_timer,
> jiffies + POLL_SPURIOUS_IRQ_INTERVAL);
168,169d188
< mod_timer(&poll_spurious_irq_timer,
< jiffies + POLL_SPURIOUS_IRQ_INTERVAL);
180c199
< * If 99,900 of the previous 100,000 interrupts have not been handled
---
> * If 9 of the previous 10 interrupts have not been handled
184c203,211
< * (The other 100-of-100,000 interrupts may have been a correctly
---
> * Although this may cause early deactivation of a sporadically
> * malfunctioning IRQ line, the poll system will:
> * a) Poll it for 100 cycles at a 100 Hz rate
> * b) Reenable it afterwards
> *
> * In worst case, with current settings, this will cause short bursts
> * of 10 interrupts every second.
> *
> * (The other single interrupt may have been a correctly
305c332
< if (likely(desc->irq_count < 100000))
---
> if (likely(desc->irq_count < 10))
309c336
< if (unlikely(desc->irqs_unhandled > 99900)) {
---
> if (unlikely(desc->irqs_unhandled >= 9)) {
313c340
< __report_bad_irq(irq, desc, action_ret);
---
> /* __report_bad_irq(irq, desc, action_ret); */
317c344
< printk(KERN_EMERG "Disabling IRQ #%d\n", irq);
---
> printk(KERN_EMERG "Disabling IRQ %d\n", irq);

(End of patch)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/