Re: [PATCH v15 4/5] PCI/DPC: Add Error Disconnect Recover (EDR) support

From: Kuppuswamy Sathyanarayanan
Date: Wed Feb 26 2020 - 17:14:40 EST



On 2/26/20 1:32 PM, Bjorn Helgaas wrote:
On Wed, Feb 26, 2020 at 10:42:27AM -0800, Kuppuswamy Sathyanarayanan wrote:
On 2/25/20 5:02 PM, Bjorn Helgaas wrote:
On Thu, Feb 13, 2020 at 10:20:16AM -0800, sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx wrote:
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
...

Yes, we could remove it. But it might need some more changes to
dpc driver functions. I can think of two ways,

1. Re-factor the DPC driver not to use dpc_dev structure and just use
pci_dev in their functions implementation. But this might lead to
re-reading following dpc_dev structure members every time we
use it in dpc driver functions.

(Currently in dpc driver probe they cache the following device parameters )

 9 u16 cap_pos;
Â10ÂÂÂÂÂÂÂÂ boolÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ rp_extensions;
Â11ÂÂÂÂÂÂÂÂ u8ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ rp_log_size;
Â12ÂÂÂÂÂÂÂÂ u16ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ctl;
Â13ÂÂÂÂÂÂÂÂ u16ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ cap;
I think this is basically what I proposed with the sample patch in my
response to your 3/5 patch. But I don't see the ctl/cap part, so
maybe I missed something.
if its costly to carry it in pci_dev, we can always re-read them.
if its ok to use pci_dev, If you want, I can extend your patch to
include the cap and ctl.
This message should be expanded somehow. I think the point is that we
got an EDR notification, but firmware couldn't tell us where the
containment event occurred. Should that ever happen? Or is it a
firmware defect if it *does* happen?
Yes, if we hit this error then its a firmware defect. Either
firmware sent wrong BDF value or used invalid return type.

I was planning to add some extra error info in acpi_locate_dpc_port()

166 +ÂÂÂÂÂÂ if (obj->type != ACPI_TYPE_INTEGER) {
167 +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ACPI_FREE(obj);
168 +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return NULL;
169 +ÂÂÂÂÂÂ }


In any event, I think the message should say something like "Can't
identify source of EDR notification".
I will use your suggestion here along with above mentioned change.

This seems... I'm not sure what. I guess it's really just reading
the DPC capability for use by dpc_process_error(), so maybe it's OK.
But it's a little strange to read.
I *think* maybe if we move the DPC info into the struct pci_dev it
will solve this issue too? I.e., we won't have a struct dpc_dev, so
we won't have this funny-looking dpc_dev_init().
Yes, your patch will resolve this issue.

No this is a valid case. it will only happen if we have a non-acpi
based switch attached to root port.
I agree this is a valid case (as I mentioned below). My point was
just that if it is a valid case, we might not want to use pci_warn()
here. Maybe pci_info() if you think it's important, or maybe no
message at all. I don't think "Initializing dpc again" is going to be
useful to a user.
Got it. Adding pci_info here will be helpful to understand the flow.
Since EDR is a rare case, it should not pollute the dmesg. So I will add it.

Yes, ownership should be based on _OSC negotiation. I will add necessary
comments here.
Why are we not doing this via _OSC negotiation in this series? It
would be much better if we could just do it instead of adding a
comment that we *should* do it. Nobody knows more about this than you
do, so probably nobody else is going to come along and finish this
up :)
Actually Alex G already proposed a patch to fix it.

https://lkml.org/lkml/2018/11/16/202

But that discussion never reached a conclusion. Since a proper fix
for it would affect some legacy hardwares which solely relies on
HEST tables, it did not make everyone happy. So it might take a
lot to convince all the stake holders to merge such patch. So its
better not to mix both of these patch sets together.

Once this patch set is done, If Alex G is no longer working on it,
I can work on it.

--
Sathyanarayanan Kuppuswamy
Linux kernel developer