From: Chris Friesen
Date: Thu Jun 10 2010 - 14:30:16 EST

On 06/10/2010 11:29 AM, Brian Gordon wrote:

> When these SEU can be detected some action may be taken to improve
> the behaviour of the system (log a fault and reset in order to
> refresh things from scratch?). So the first question becomes how to
> detect an SEU.

I do work in telco stuff. We use ECC RAM, turn on ECC/parity on the
various buses, enable error-checking in the hardware, etc.

At higher abstraction levels you can checksum the data being stored and
validate it when you access it.

Some of the errors are "soft" and can be corrected, others are "hard"
and uncorrectable. If you get enough "soft" errors in a short enough
time it may be desirable to treat it as a "hard" error and reset.

> Thank you to anyone for any pointers on where I can look to learn
> more about detecting SEU in linux.

You might start by taking a look at the "edac" code in the kernel.
Linux in general doesn't normally enable all the fault detection code,
so you may need to start looking at datasheets.


