[PATCH] NMI trap revised (was Re: NMI errors in 2.0.30??)

Riccardo Facchetti (fizban@mbox.vol.it)
Wed, 7 May 1997 22:10:37 +0200 (MET DST)


Is this a good idea ?

I was following the NMI discussion and today I have fond, under a pile of
old DDJ issues, an old manual which explain something (really not much)
about NMI on the old 80286 PC/XT. Supposing the NMI mechanism is not
changed for (23456)86, this is a patch that can help detect if the NMI is
a real hardware failure or triggered by something else.

On 6 May 1997, Matthias Urlichs wrote:
>
> The best memory checker is a kernel recompile, a network download, and a
> floppy read. At the same time. Repeatedly. I have yet to see a problem that
> didn't eventually show up under this kind of stress.

Hmmm if you have paid for parity memory I think you can at least try to
use NMI for something other than be there and pop up when some random bits
of memory are wrong. With NMI you can try to detect the memory chip which
is failing.

Here the kernel bloat :)
If there is interest in this thing, I can finish writing the mem test part
of this code.

Ciao,
Riccardo.

--- linux-2.1.36/arch/i386/kernel/traps.c Mon May 5 12:05:19 1997
+++ linux/arch/i386/kernel/traps.c Wed May 7 21:45:50 1997
@@ -2,6 +2,9 @@
* linux/arch/i386/traps.c
*
* Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * 1997-05-07 Modified do_nmi() by Riccardo Facchetti to try to display some
+ * useful information instead of the old generic message.
*/

/*
@@ -237,16 +240,83 @@

asmlinkage void do_nmi(struct pt_regs * regs, long error_code)
{
- printk("NMI\n"); show_registers(regs);
+#ifndef CONFIG_IGNORE_NMI
+ unsigned int nmi_info = 0;
+
+/*
+ * Before doing anything else, get the byte from port 0x61 (System Board
+ * I/O Port). It should contain informations about what caused NMI (at
+ * least this is what stated this old reference manual I have in front
+ * of me).
+ */
+ nmi_info = inb_p(0x61);
+#endif
+
#ifdef CONFIG_SMP_NMI_INVAL
smp_flush_tlb_rcv();
#else
#ifndef CONFIG_IGNORE_NMI
- printk("Uhhuh. NMI received. Dazed and confused, but trying to continue\n");
- printk("You probably have a hardware problem with your RAM chips or a\n");
- printk("power saving mode enabled.\n");
+
+ printk("NMI received.\n");
+
+/*
+ * Test bit 7 and 6 for Memory Parity or I/O Channel error.
+ */
+ if (nmi_info & 0xC0) {
+/*
+ * This is a real error condition, sort out what error occurred.
+ */
+ if (nmi_info & 0x80) {
+ printk("RAM Parity Check: memory parity error.\n");
+
+ /*
+ * May be sort out what memory chip is failing ?
+ * Heh ... with parity memory we can be a good memory
+ * test program too :)
+ * It should be something like:
+ *
+ * (1) disable NMI interrupts writing 1 in bit 7 of
+ * port 0x70
+ * (2) reset the NMI memory parity error flag (bit 7)
+ * toggling bit 2 of 0x61 port to 1 and then to 0
+ * (3) while all flat memory is tested:
+ * (4) write 4Kb page in memory
+ * (5) test if any NMI is pending: if yes, the
+ * last page written is bogus, printk its
+ * address.
+ * (6) ++ page
+ * (7) panic() out: we have no more things other that
+ * raw kernel, running on this machine now.
+ *
+ * In (4) we should care not to overwrite the kernel
+ * because I suspect we need it at least for printk()
+ * and panic()
+ */
+ }
+
+ if (nmi_info & 0x40) {
+ printk("I/O Channel Check: I/O channel adapter error.\n");
+ /*
+ * I don't have a clue on how to have more
+ * informations about what is failing here.
+ */
+ }
+/*
+ * May be safer panic here ? We have an error on memory chip or on I/O channel
+ * adapter, so if we care our data, we should stop all the things now.
+ */
+ } else {
+/*
+ * Huh ?? unexpected NMI !!
+ */
+
+ printk("Uhhuh. Dazed and confused, but trying to continue.\n");
+ printk("You probably have a power saving mode enabled.\n");
+ }
#endif
#endif
+
+ show_registers(regs);
}

asmlinkage void do_debug(struct pt_regs * regs, long error_code)