Enhanced PCI Error Recovery

From: linas
Date: Wed Mar 24 2004 - 17:00:45 EST



Hi,

I'm trying to implement some enhanced PCI error recovery features
for some of the PCI chipsets on IBM pSeries (powerpc64) boxes.
I'm lead to beleive that other chipsets (in particular PCI Express
chipsets) might have similar features, and thus I'd like to find
out who else is working on something like this & could share in
the discussion/design/implementation.

The IBM pSeries boxes have something called "EEH" (Enhanced Error
Handling) that deals with the kinds of errors that the older PCI
spec explicitly ignored (except to check-stop the cpu, which would
be a bad thing to do on a big server). EEH handled errors that
couldn't otherwise be "reported", such as parity errors on posted
writes, or io adapters that assert SERR (system error) back to
the host bridge. It handled these errors by (synchronously)
cutting off all i/o to the adapter (writes are dropped on the
floor, reads return all foxes). It provides a way of 'recovering'
from these, in a rather blunt fashion: for all practical purposes,
its equivalent to toggling the power for that PCI slot, and
restarting the device driver.

I've been told that PCI Express provides roughly similar function
(although that's far from obvious) and that there are people working
on implementing it for the linux kernel. I'm reading the PCI Express
spec; it talks about "fatal errors": these are handled by interrupting
the cpu, and putting the error status in various registers. The spec
does not seem to talk about how the software can recover from
the "unrecoverable" errors, nor does it seem to mention what happens
if a device driver continues to try to access a device after an
unrecoverable error occurred.

I was hoping to write the pSeries code so that it mostly sat in
the arch/ppc64 directory... but if there are really other efforts
then maybe we should talk about sharing code/design.

(My goal is to reset the device, be it a scsi card or an ethernet card,
re-init the device driver, and go on, ideally without letting the
block device or sockets layer up above to ever find out there was
any problem. So that e.g. an unrecoverable error under your root
partition doesn't have to end in a kernel panic).

Due to the high traffic on LKML, and some superficial similarity
between this and pci-hotplug, I recommend continuing discussion on

linux-hotplug-devel@xxxxxxxxxxxxxxxxxxxxx
http://linux-hotplug.sourceforge.net

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/