[RFC] How drivers notice a HW error?

From: Hidetoshi Seto
Date: Thu Nov 27 2003 - 03:31:24 EST


Hi all,

This is a request for comments, especially comments from driver developers.

On some platform, for example IA64, the chipset detects an error caused by
driver's operation such as I/O read, and reports it to kernel. Linux kernel
analyzes the error and decides to kill the driver or reboot at worst.
I want to convey the error information to the offending driver, and want to
enable the driver to recover the failed operation.

So, just a plan, I think about a readb_check function that has checking ability
enable it to return error value if error is occurred on read. Drivers could use
readb_check instead of usual readb, and could diagnosis whether a retry be
required or not, by the return value of readb_check.

To realize this, I consider following two images:

+ readb_check on driver (with Notifier)
[Outline]:
- Hardware error handler (for example in IA64, MCA handler) has a Notifier
as hook point.
- Driver may register a hook function to the Notifier.
- Notifier calls over registered functions when error is occurred.
- Called hook function checks address of error, and if the error seems
to be concerned with the parent driver, ups internal error flag and
stops Notifier by returning OK.
- Hardware error handler regards state of Notifier, and decides the system
to resume or not.
- Restarted driver may refer the error flag after read, and may retry the
read if flag is up.
[Issue]:
- Some interfaces such as register hooks would be required.
- Coding a hook function would be a bother of developers.

+ readb_check on kernel
[Outline]:
- Kernel has readb_check function.
- Drivers may use readb_check instead of usual readb.
- Hardware error handler checks address of error, and if it occurs in
readb_check, changes return value of readb_check and resumes
interrupted context.
- Driver may refer the return value to notice an error in last read
procedure.
[Issue]:
- Overhead would be involved. (Possibly, it could say negligible since
I/O reads are already horribly slow.)

IMO, this is a general-purpose function that should be available on many
platforms. I also hear that Solaris has some similar implementations like this.

If you have any comment about this feature or any idea different from this,
please tell me.


Best regards,

------

H.Seto <seto.hidetoshi@xxxxxxxxxxxxxx>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/