APIC error interrupt routine.

From: Rogier Wolff (R.E.Wolff@bitwizard.nl)
Date: Mon Aug 14 2000 - 05:27:43 EST


Hi Linus,

In the "APIC error interrupt" routine, some funny things happen: First
a spinlock is taken. Next some elaborate multiline prints are done,
reporting the "problem". Then the chip is "tickled" into being ready
for the next interrupt. Then the spinlock is unlocked.

If I'm not mistaken, the chip could interrupt us after being tickled
to be ready for the next interrupt. Now, this 'error interrupt'
normally never happens, so the chances of this happening AGAIN before
the printk's finish should be minimal, right? No.

In fact, the ABIT BP6 board is a bit shabby. It has a few problems.
Besides some versions having power supply issues, the APICs seem to
have transmission errors between them. When this happens, the APIC
ends up interrupting us a few times in quick succession, as probably
it's out of sync with its partner.

So, one of the things that I think are happening with the lockups on
BP6's is that when the spinlock is still taken, the apic interrupts us
again and that's one dead CPU spinning for a lock that it holds
itself. The other CPU probably will start spinning for this same
spinlock within a few seconds, so to most people this will look like a
complete lockup.

If I'm not mistaken, the spinlock should've been an "spinlock_irqsave"
in the first place, right?

Anyway, I think it's safest to remove the spinlock alltogether, as
long as we ack the APIC at the latest possible moment. I also think
that it's safe to just postpone the "time consuming" printk to just
after the ack to the APIC. Worst case, we'll see a quick succession of
APIC irq's and the printk's may be messed up. So?

What do you think about the attached patch?

Eric Andreychek reported that he managed to get a full week of uptime
with this patch installed when before he'd get lockups within a
day. (After reporting this his machine started locking up a few times
in an hour to prove him wrong... :-( ). Anyway, I think that this is just
one of the issues with the board, which makes it

Oh, typical output of the "new" function is:

Aug 10 20:43:03 Paradise-Lost kernel: APIC error: 00(02)
Aug 10 21:04:48 Paradise-Lost kernel: APIC error: 02(02)
Aug 10 21:14:20 Paradise-Lost kernel: APIC error: 00(08)
Aug 10 21:23:47 Paradise-Lost kernel: APIC error: 08(08)
Aug 10 21:34:24 Paradise-Lost kernel: APIC error: 08(08)
Aug 10 21:47:06 Paradise-Lost kernel: APIC error: 02(04)
Aug 10 22:10:23 Paradise-Lost kernel: APIC error: 04(02)
Aug 10 22:12:02 Paradise-Lost kernel: APIC error: 08(04)
Aug 10 22:12:02 Paradise-Lost kernel: APIC error: 02(08)
Aug 10 22:29:32 Paradise-Lost kernel: APIC error: 08(02)
Aug 10 22:37:38 Paradise-Lost kernel: APIC error: 04(04)
Aug 10 22:37:38 Paradise-Lost kernel: APIC error: 02(08)
Aug 10 23:08:39 Paradise-Lost kernel: APIC error: 08(02)

which means that indeed new bits seem to be coming on in the ESR
register during the course of this routine running. Shouldn't it be
much more common to see a "0" after the first try? Anyway please
correct me if my understanding of the APIC and this code is wrong...

                        Roger.

----------------------------------------------------------------------

--- linux-2.4.0-test6-pre1.clean/arch/i386/kernel/apic.c Sat Aug 12 10:42:52 2000
+++ linux-2.4.0-test6-pre1.fs50/arch/i386/kernel/apic.c Sat Aug 12 10:43:11 2000
@@ -682,46 +709,28 @@
  * This interrupt should never happen with our APIC/SMP architecture
  */
 
-static spinlock_t err_lock = SPIN_LOCK_UNLOCKED;
-
 asmlinkage void smp_error_interrupt(void)
 {
- unsigned long v;
-
- spin_lock(&err_lock);
+ unsigned long v, v1;
 
+ /* First tickle the hardware, only then report what went on. -- REW */
         v = apic_read(APIC_ESR);
- printk(KERN_INFO "APIC error interrupt on CPU#%d, should never happen.\n",
- smp_processor_id());
- printk(KERN_INFO "... APIC ESR0: %08lx\n", v);
-
         apic_write(APIC_ESR, 0);
- v |= apic_read(APIC_ESR);
- printk(KERN_INFO "... APIC ESR1: %08lx\n", v);
- /*
- * Be a bit more verbose. (multiple bits can be set)
- */
- if (v & 0x01)
- printk(KERN_INFO "... bit 0: APIC Send CS Error (hw problem).\n");
- if (v & 0x02)
- printk(KERN_INFO "... bit 1: APIC Receive CS Error (hw problem).\n");
- if (v & 0x04)
- printk(KERN_INFO "... bit 2: APIC Send Accept Error.\n");
- if (v & 0x08)
- printk(KERN_INFO "... bit 3: APIC Receive Accept Error.\n");
- if (v & 0x10)
- printk(KERN_INFO "... bit 4: Reserved!.\n");
- if (v & 0x20)
- printk(KERN_INFO "... bit 5: Send Illegal Vector (kernel bug).\n");
- if (v & 0x40)
- printk(KERN_INFO "... bit 6: Received Illegal Vector.\n");
- if (v & 0x80)
- printk(KERN_INFO "... bit 7: Illegal Register Address.\n");
-
+ v1 = apic_read(APIC_ESR);
         ack_APIC_irq();
-
         irq_err_count++;
 
- spin_unlock(&err_lock);
+ /* Here is what the APIC error bits mean:
+ 0: Send CS error
+ 1: Receive CS error
+ 2: Send accept error
+ 3: Receive accept error
+ 4: Reserved
+ 5: Send illegal vector
+ 6: Received illegal vector
+ 7: Illegal register address
+ */
+ printk (KERN_ERR "APIC error on CPU%d: %02x(%02x)\n",
+ smp_processor_id(), v , v1);
 }
 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
*       Common sense is the collection of                                *
******  prejudices acquired by age eighteen.   -- Albert Einstein ********

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Aug 15 2000 - 21:00:33 EST