Re: [Announce] Linux Test Project

From: Andi Kleen (ak@suse.de)
Date: Thu Aug 10 2000 - 17:09:14 EST


On Thu, Aug 10, 2000 at 02:12:32PM -0500, Nathan Straz wrote:
> On Thu, Aug 10, 2000 at 11:35:03AM -0400, David Mansfield wrote:
> > One question: how is the framework going to handle tests which cause
> > pathological behavior in the kernel. For example, 'infinite' hangs in
> > the MM system during OOM, or crashes (OOPSes, panics) or deadlocks
> > (process stuck in 'D' state). Most often these are the results of the
> > tests I tend to run.
>
> That's one of the most imporant things we need to discuss. We
> definately need to build into the framework a way to recover from
> problems like this. If this will be some type of automated reboot, or
> someone walking in a rebooting the machine manually, I don't know. My
> goals are to get the system as automated as possible. It may turn out
> that we will include these tests as manual tests for completeness. We
> would like to discuss any ideas people have.

Linux is already shipping with a software watchdog. When enabled it wants
a regular write from a user mode daemon to a special character device.
When the write doesn't happen after some time it'll reboot.

The only problem is when it was crashing with interrupts off. At least on
machines with a global APIC (=usually SMP) there is a NMI watchdog in 2.4
that triggers an oops after some time of interruptless spinning.
It only works on SMP build.

I would suggest to recommend configuring the software watchdog before
running any critical tests. The test procedure could also use a simple
log mechanism (write a START TEST record to a log file, fsync it) and
a restart mechanism that tries to figure out any crashes so they can
be logged.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Aug 15 2000 - 21:00:22 EST