Re: [PATCH v1 3/4] PCI: brcmstb: Add panic/die handler to RC driver

From: Bjorn Helgaas
Date: Tue May 25 2021 - 17:17:21 EST


On Tue, May 25, 2021 at 05:05:51PM -0400, Jim Quinlan wrote:
> On Tue, May 25, 2021 at 4:40 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > On Tue, Apr 27, 2021 at 01:51:38PM -0400, Jim Quinlan wrote:
> > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
> > > by default Broadcom's STB PCIe controller effects an abort. This simple
> > > handler determines if the PCIe controller was the cause of the abort and if
> > > so, prints out diagnostic info.
> > >
> > > Example output:
> > > brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000
> > > brcm-pcie 8b20000.pcie: Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0
> >
> > What happens to the driver that performed the illegal access?
>
> The entire system dies from the abort. Some customers elect to do a
> fixup in the abort handler but we admonish them to fix the root cause.
> With these patches we at least get immediate information about the
> access that caused the abort.
> >
> > Does this mean that errors that are recoverable on other hardware (by
> > noticing the 0xffffffff and checking for error) are fatal on the
> > Broadcom STB?
>
> Yes. For example, I have an old Rocketport RP2 card I sometimes use
> for testing. On a Broadcom STB it dies when the rp2 probe does a
> read after calling rp2_reset_asic(). On an x86, 0xffffffff is
> returned on this read and all is well.
>
> I don't think there is any PCIe spec that mandates an access error
> returns 0xffffffff. Some of our SOCs have a new feature where we can
> return the 0xffffffff instead of getting an abort. We will allow the
> customer to turn this on if they ask for it, but for the time being we
> prefer an abort as many drivers do not check for 0xffffffff.

Right, the mechanism of error reporting is an implementation choice.
Few drivers are actually prepared to deal with 0xffffffff data, but in
many systems, especially those with removable PCI devices, it is
important to be able to report errors in a way that doesn't crash the
system.

That may not be a concern in your system, so maybe just mention that
this is a fatal error for the system in the commit log.

Bjorn