Re: [PATCH v1 1/1] PCI/ERR: Handle fatal error recovery for non-hotplug capable devices

From: Kuppuswamy, Sathyanarayanan
Date: Tue May 26 2020 - 23:06:07 EST


Hi,

On 5/26/20 8:00 PM, Oliver O'Halloran wrote:
On Wed, May 27, 2020 at 12:00 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> wrote:

Hi,

On 5/21/20 7:56 PM, Yicong Yang wrote:


On 2020/5/22 3:31, Kuppuswamy, Sathyanarayanan wrote:

Not exactly. In pci_bus_error_reset(), we call pci_slot_reset() only if it's
hotpluggable. But we always call pci_bus_reset() to perform a secondary bus
reset for the bridge. That's what I think is unnecessary for a normal link,
and that's what reset link indicates us to do. The slot reset is introduced
in the process only to solve side effects. (c4eed62a2143, PCI/ERR: Use slot reset if available)

IIUC, pci_bus_reset() will do slot reset if its supported (hot-plug
capable slots). If its not supported then it will attempt secondary
bus reset. So secondary bus reset will be attempted only if slot
reset is not supported.

Since reported_error_detected() requests us to do reset, we will have
to attempt some kind of reset before we call ->slot_reset() right?

Yes, the driver returns PCI_ERS_RESULT_NEED_RESET from
->error_detected() to indicate that it doesn't know how to recover
from the error. How that reset is performed doesn't really matter, but
it does need to happen.


PCI_ERS_RESULT_NEED_RESET indicates that the driver
wants a platform-dependent slot reset and its ->slot_reset() method to be called then.
I don't think it's same as slot reset mentioned above, which is only for hotpluggable
ones.
What you think is the correct reset implementation ? Is it something
like this?

if (hotplug capable)
try_slot_reset()
else
do_nothing()

Looks broken to me, but all the reset handling is a rat's nest so
maybe I'm missing something. In the case of a DPC trip the link is
disabled which has the side-effect of hot-resetting the downstream
device. Maybe it's fine?
Yes, in case of DPC (Fatal errors) link is already reset. So we
don't need any special handling. This reset logic is mainly for
non-fatal errors.

As an aside, why do we have both ->slot_reset() and ->reset_done() in
the error handling callbacks? Seems like their roles are almost
identical.
Not sure.I think reset_done() is final cleanup.

Oliver