Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER

From: Alex G.
Date: Thu Aug 09 2018 - 15:42:34 EST




On 08/09/2018 02:18 PM, Bjorn Helgaas wrote:
On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote:
On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:
On Thu, Aug 09, 2018 at 04:46:32PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote:
On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
(snip_
enable_ecrc_checking()
disable_ecrc_checking()

I don't immediately see how this would affect FFS, but the bits are part
of the AER capability structure. According to the FFS model, those would
be owned by FW, and we'd have to avoid touching them.

Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers. It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.

The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port
described by the HEST entry.

I agree the code is convoluted, but that sounds right to me.

What I meant is that we ignore the values the HEST entry tells us
we're supposed to write to Device Control and the AER Uncorrectable
Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and
AER Capabilities and Control.

Wait, what? _HPX has the same information. This is madness!
Since root ports are not hot-swappable, the BIOS normally programs those registers. Even if linux doesn't apply said masks, the programming BIOS did should be sufficient to have *cough* correct *cough* behavior.

For practical considerations this is not an issue today. The ACPI error
handling code currently crashes when it encounters any fatal error, so
we wouldn't hit this in the FFS case.

I wasn't aware the firmware-first path was *that* broken. Are there
problem reports for this? Is this a regression?

It's been like this since, I believe, 3.10, and probably much earlier. All
reports that I have seen of linux crashing on surprise hot-plug have been
caused by the panic() call in the apei code. Dell BIOSes do an extreme
amount of work to determine when it's safe to _not_ report errors to the OS,
since all known OSes crash on this path.

Oh, is this the __ghes_panic() path? If so, I'm going to turn away
and plead ignorance unless the PCI core is doing something wrong that
eventually results in that panic.

I agree, and I'll quote you on that!

Alex