Re: pci error recovery procedure

From: Linas Vepstas
Date: Tue Sep 12 2006 - 15:39:20 EST


On Thu, Sep 07, 2006 at 11:18:56AM +0800, Zhang, Yanmin wrote:
> The error recovery procedures
> are to process pci hardware errors instead of device driver bug.

Over the last three years, we've uncovered (and fixed) dozens of
device driver bugs that were only detected because of the pci error
detection hardware. The ability to get device dumps is important,
because many of these bugs are hard to reproduce, require getting
PCI bus analyzers attached to the system, etc.

> Current error handler infrastructure could support pci-e, but I want a better
> solution to faciliate driver developers to add error handlers more easily. My
> startpoint is driver developer. If they are not willing to add error handlers,
> it's impossible to do so for all drivers by you and me.

Right. As a result, we only care about the products that we actually
sell to customers. PCI error recovery is not some "gee its nice" piece
of eye-candy or chrome: either one is serious about high-availability,
or one is not.

--linas



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/