Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file

From: dan.j.williams
Date: Thu Jul 24 2025 - 16:24:27 EST


Bowman, Terry wrote:
> On 7/23/2025 8:16 PM, dan.j.williams@xxxxxxxxx wrote:
> > Terry Bowman wrote:
> >> The CXL AER error handling logic currently resides in the AER driver file,
> >> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
> >> using #ifdefs.
> >>
> >> Improve the AER driver maintainability by separating the CXL specific logic
> >> from the AER driver's core functionality and removing the #ifdefs.
> >> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
> >> new file.
> >>
> >> Update the makefile to conditionally compile the CXL file using the
> >> existing CONFIG_PCIEAER_CXL Kconfig.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
> >> ---
> > After reading patch5 I want to qualify my Reviewed-by:...
> >
> >> drivers/pci/pci.h | 8 +++
> >> drivers/pci/pcie/Makefile | 1 +
> >> drivers/pci/pcie/aer.c | 138 -------------------------------------
> >> drivers/pci/pcie/cxl_aer.c | 138 +++++++++++++++++++++++++++++++++++++
> > This is a poor name for this file because the functionality only relates to
> > code that supports a dead-end generation of RCH / RCD hardware platforms.
> >
> > I do agree that it should be removed from aer.c so typical PCIe AER
> > maintenance does not need to trip over that cruft.
> >
> > Please call it something like rch_aer.c so it is tucked out of the way,
> > sticks out as odd in any future diffstat, and does not confuse from the
> > CXL VH error handling that supports current and future generation
> > hardware.
> >
> > Perhaps even move it to its own silent Kconfig symbol with a deprecation
> > warning, something like below, so someone remembers to delete it.
>
> cxl_rch_handle_error_iter() and cxl_rch_handle_error() need to be moved from pci/pcie/cxl_aer.c
> into cxl/core/native_ras.c introduced in this series. There is no RCH or VH handling in cxl_aer.c. 
> cxl_aer.c serves to detect if an error is a CXL error and if it is then it forwards it to the 
> CXL drivers using the kfifo introduced later. I will update the commit message stating more 
> will be added later.

Wait, this set moves the same function to a new file twice in the same
set? I had not gotten that far along, but that's not acceptable.

The reasons I had assumed that the rch bits would remain as a vestigial
drivers/pci/pcie/rch_aer.c file to be cut from the kernel later are:

- The goal of forwarding protocol errors to the cxl_core is that the
cxl_core maintains a cxl_port hierarchy. For the RCH case there is no
hierarchy and little to no value in being able disposition or decorate
error reports with the cxl_port driver.

- The RCH code requires a series of new PCI core exports for this
one-off unfortunate mistake of history where the CXL specification
tried way too hard to hide the presence of CXL. If this code is
already on a deprecation path, that contraindicates new exports.

> Dave Jiang introduced cxl/core/pci_aer.c I understand the name is still up for possible change.
> The native_ras.c changes in this series is planned to be moved into cxl/core/pci_aer.c for v11. 
> The files were created with the same purpose but we used different filenames and need to converge.

Why not put this stuff in the existing cxl/core/ras.c? I do expect that
we want to route CPER reports to cxl_port objects at some point, so the
"native" distinction is more confusing than beneficial as far as I can
see.