Re: [RFC PATCH 5/9] PCI/AER: Apply function level reset to RCiEP on fatal error

From: Sean V Kelley
Date: Tue Jul 28 2020 - 12:14:15 EST


On 28 Jul 2020, at 6:27, Zhuo, Qiuxu wrote:

From: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
Sent: Monday, July 27, 2020 7:17 PM
To: Kelley, Sean V <sean.v.kelley@xxxxxxxxx>
Cc: bhelgaas@xxxxxxxxxx; rjw@xxxxxxxxxxxxx; ashok.raj@xxxxxxxxxx; Luck,
Tony <tony.luck@xxxxxxxxx>;
sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx;
linux-kernel@xxxxxxxxxxxxxxx; Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx>
Subject: Re: [RFC PATCH 5/9] PCI/AER: Apply function level reset to RCiEP
on fatal error

On Fri, 24 Jul 2020 10:22:19 -0700
Sean V Kelley <sean.v.kelley@xxxxxxxxx> wrote:

From: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx>

Attempt to do function level reset for an RCiEP associated with an
RCEC device on fatal error.

I'd like to understand more on your reasoning for flr here.
Is it simply that it is all we can do, or is there some basis in a spec
somewhere?


Yes. Though there isn't the link reset for the RCiEP here, I think we should still be able to reset the RCiEP via FLR on fatal error, if the RCiEP supports FLR.

-Qiuxu


Also see PCIe 5.0-1, Sec. 6.6.2 Function Level Reset (FLR)

Implementation of FLR is optional (not required), but is strongly recommended. For an example use case consider CXL. Function 0 DVSEC instances control for the CXL functionality of the entire CXL device. FLR may succeed in recovering from CXL.io domain errors.

Thanks,

Sean


Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx>
---
drivers/pci/pcie/err.c | 31 ++++++++++++++++++++++---------
1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index
044df004f20b..9b3ec94bdf1d 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -170,6 +170,17 @@ static void pci_walk_dev_affected(struct
pci_dev *dev, int (*cb)(struct pci_dev
}
}

+static enum pci_channel_state flr_on_rciep(struct pci_dev *dev) {
+if (!pcie_has_flr(dev))
+return PCI_ERS_RESULT_NONE;
+
+if (pcie_flr(dev))
+return PCI_ERS_RESULT_DISCONNECT;
+
+return PCI_ERS_RESULT_RECOVERED;
+}
+
pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
enum pci_channel_state state,
pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
@@ -191,15
+202,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
if (state == pci_channel_io_frozen) {
pci_walk_dev_affected(dev, report_frozen_detected,
&status);
if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END) {
-pci_warn(dev, "link reset not possible for RCiEP\n");
-status = PCI_ERS_RESULT_NONE;
-goto failed;
-}
-
-status = reset_link(dev);
-if (status != PCI_ERS_RESULT_RECOVERED) {
-pci_warn(dev, "link reset failed\n");
-goto failed;
+status = flr_on_rciep(dev);
+if (status != PCI_ERS_RESULT_RECOVERED) {
+pci_warn(dev, "function level reset failed\n");
+goto failed;
+}
+} else {
+status = reset_link(dev);
+if (status != PCI_ERS_RESULT_RECOVERED) {
+pci_warn(dev, "link reset failed\n");
+goto failed;
+}
}
} else {
pci_walk_dev_affected(dev, report_normal_detected,
&status);