Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

From: Christoph Hellwig
Date: Thu Feb 28 2019 - 09:17:11 EST


On Wed, Feb 27, 2019 at 08:04:35PM +0000, Austin.Bolen@xxxxxxxx wrote:
> Confirmed this issue does not apply to the referenced Dell servers so I
> don't not have a stake in how this should be handled for those systems.
> It may be they just don't support surprise removal. I know in our case
> all the Linux distributions we qualify (RHEL, SLES, Ubuntu Server) have
> told us they do not support surprise removal. So I'm guessing that any
> issues found with surprise removal could potentially fall under the
> category of "unsupported".
>
> Still though, the larger issue of recovering from other types of PCIe
> errors that are not due to device removal is still important. I would
> expect many system from many platform makers to not be able to recover
> PCIe errors in general and hopefully the new DPC CER model will help
> address this and provide added protection for cases like above as well.

FYI, a related issue I saw about a year two ago with Dell servers was
with a dual ported NVMe add-in (non U.2) card, is that once you did
a subsystem reset, which would cause both controller to retrain the link
you'd run into Firmware First error handling issue that would instantly
crash the system. I don't really have the hardware anymore, but the
end result was that I think the affected product ended up shipping
with subsystem resets only enabled for the U.2 form factor.