RE: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery

From: Shiju Jose
Date: Fri Jun 27 2025 - 08:28:09 EST


>-----Original Message-----
>From: Terry Bowman <terry.bowman@xxxxxxx>
>Sent: 26 June 2025 23:43
>To: dave@xxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>;
>dave.jiang@xxxxxxxxx; alison.schofield@xxxxxxxxx; dan.j.williams@xxxxxxxxx;
>bhelgaas@xxxxxxxxxx; Shiju Jose <shiju.jose@xxxxxxxxxx>;
>ming.li@xxxxxxxxxxxx; Smita.KoralahalliChannabasappa@xxxxxxx;
>rrichter@xxxxxxx; dan.carpenter@xxxxxxxxxx;
>PradeepVineshReddy.Kodamati@xxxxxxx; lukas@xxxxxxxxx;
>Benjamin.Cheatham@xxxxxxx;
>sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx; terry.bowman@xxxxxxx;
>linux-cxl@xxxxxxxxxxxxxxx
>Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx
>Subject: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error
>recovery
>
>Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
>handling. Follow similar design as found in PCIe error driver,
>pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs as
>fatal with a kernel panic. This is to prevent corruption on CXL memory.
>
>Export the PCI error driver's merge_result() to CXL namespace. Introduce
>PCI_ERS_RESULT_PANIC and add support in merge_result() routine. This will be
>used by CXL to panic the system in the case of uncorrectable protocol errors. PCI
>error handling is not currently expected to use the PCI_ERS_RESULT_PANIC.
>
>Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the first
>device in all cases.
>
>Copy the PCI error driver's report_error_detected() to
>cxl_report_error_detected().
>Note, only CXL Endpoints and RCH Downstream Ports(RCH DSP) are currently
>supported. Add locking for PCI device as done in PCI's report_error_detected().
>This is necessary to prevent the RAS registers from disappearing before logging
>is completed.
>
>Call panic() to halt the system in the case of uncorrectable errors (UCE) in
>cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use if a UCE is
>not found. In this case the AER status must be cleared and uses
>pci_aer_clear_fatal_status().
>
>Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>---
> drivers/cxl/core/native_ras.c | 44 +++++++++++++++++++++++++++++++++++
> drivers/pci/pcie/cxl_aer.c | 3 ++-
> drivers/pci/pcie/err.c | 8 +++++--
> include/linux/aer.h | 11 +++++++++
> include/linux/pci.h | 3 +++
> 5 files changed, 66 insertions(+), 3 deletions(-)
>
[...]
>
> void pci_print_aer(struct pci_dev *dev, int aer_severity, diff --git
>a/include/linux/pci.h b/include/linux/pci.h index 79326358f641..16a8310e0373
>100644
>--- a/include/linux/pci.h
>+++ b/include/linux/pci.h
>@@ -868,6 +868,9 @@ enum pci_ers_result {
>
> /* No AER capabilities registered for the driver */
> PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
>+
>+ /* System is unstable, panic. Is CXL specific */
>+ PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
Extra space is present after casting?
> };
>
> /* PCI bus error event callbacks */
>--
>2.34.1