Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found on AMD USB4 controller

From: Limonciello, Mario
Date: Thu Sep 08 2022 - 11:22:33 EST


On 9/8/2022 09:02, Kai-Heng Feng wrote:
"

On Thu, Sep 8, 2022 at 12:30 AM Limonciello, Mario
<Mario.Limonciello@xxxxxxx> wrote:

[Public]

Hi,

-----Original Message-----
From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
Sent: Monday, September 5, 2022 02:30
To: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
Cc: mika.westerberg@xxxxxxxxxxxxxxx; andreas.noever@xxxxxxxxx;
michael.jamet@xxxxxxxxx; YehezkelShB@xxxxxxxxx; Mehta, Sanju
<Sanju.Mehta@xxxxxxx>; Limonciello, Mario
<Mario.Limonciello@xxxxxxx>; linux-usb@xxxxxxxxxxxxxxx; linux-
kernel@xxxxxxxxxxxxxxx
Subject: Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found
on AMD USB4 controller

On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
AMD USB4 can not detect external PCIe devices like external NVMe when
it's hotplugged, because card/link are not up:

pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101

That sounds like a hardware bug, how does this work in other operating
systems for this hardware?

We happen to have this HP system in our lab. My colleague Anson (now on CC) flashed
the same BIOS to it (01.02.01) using dediprog and loaded a 6.0-rc3 mainline kernel built
from the Canonical mainline kernel PPA.

He then tried to hotplug a TBT3 SSD a number of times but couldn't hit this issue.
I attached his log to the kernel Bugzilla.

Nice to hear. Hopefully this can be fixed at firmware/hardware side.

I guess you and Anson might want to sync up offline and compare whether you have the same hardware stepping.




Use `lspci` to resume pciehp bridges can find external devices.

That's not good :(

A long delay before checking card/link presence doesn't help, either.
The only way to make the hotplug work is to enable pciehp interrupt and
check card presence after the TB switch is added.

Since the topology of USB4 and its PCIe bridges are siblings, hardcode
the bridge ID so TBT driver can wake them up to check presence.

As I mention below, this is not an acceptable solution.

AMD developers, any ideas on how to get this fixed in the TB controller
firware instead?

Anson also double checked on the AMD reference hardware that the HP system is built
against and couldn't reproduce it there either.

KH, I've got a few questions/comments to try to better explain why we're here.

1) How did you flash the 01.02.01 firmware? In Anson's check, he used dediprog.
Is it possible there was some stateful stuff used by HP's BIOS still on the SPI from the
upgrade that didn't get set/cleared properly from an earlier pre-release BIOS?

We used UEFI capsule to update the firmware, via fwupd.

So that's a difference from how Anson did it. Could you perhaps dump the BIOS SPI image? Anson can flash the exact same dump via dediprog and see if he can repro.

It would let us confirm if it was caused by your upgrade path.



2) Did you change any BIOS settings? Particularly anything to do with Pre-OS CM?

No, nothing in BIOS was changed. >

3) If you explicitly reset to HP's "default BIOS settings" does it resolve?

Doesn't help. I put the device to ACPI G3 and it doesn't help, either.

OK.



4) Can you double check ADP_CS_5 bit 31? I attached is a patch to kernel Bugzilla to
add dyndbg output for it. If it was for some reason set by Pre-OS CM in your BIOS/settings
combination, we might need to undo it by the Linux CM.

All ports say "Hotplug disabled: 0".

dmesg attached to the bugzilla.

OK, that at least rules out DHP from Pre-OS CM.



5) Are you changing any of the default runtime PM policies for any of the USB4 routers or
root ports used for tunneling using software like TLP?

No. And they should be suspended by default.


Thinking about this being possibly a firmware upgrade path problem, can you please check:

# grep SMC /sys/kernel/debug/dri/0/amdgpu_firmware_info

Anson's system was 0x04453200 (program 4, version 69.50.0).

Kai-Heng




Bugzilla:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
illa.kernel.org%2Fshow_bug.cgi%3Fid%3D216448&amp;data=05%7C01%7Cm
ario.limonciello%40amd.com%7C1e27b1d6f69e42796c7b08da8f107121%7C3d
d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637979598042186185%7CU
nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0lhcaKfUyoK
0FXT9uDZ8a%2Fpxs9tHd8aoQcyPFdB%2F0eY%3D&amp;reserved=0
Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
---
drivers/thunderbolt/nhi.c | 29 +++++++++++++++++++++++++++++
drivers/thunderbolt/switch.c | 6 ++++++
drivers/thunderbolt/tb.c | 1 +
drivers/thunderbolt/tb.h | 5 +++++
include/linux/thunderbolt.h | 1 +
5 files changed, 42 insertions(+)

diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index cb8c9c4ae93a2..75f5ce5e22978 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const
struct pci_device_id *id)
{
struct tb_nhi *nhi;
struct tb *tb;
+ struct pci_dev *p = NULL;
+ struct tb_pci_bridge *pci_bridge, *n;
int res;

if (!nhi_imr_valid(pdev)) {
@@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const
struct pci_device_id *id)
nhi_shutdown(nhi);
return res;
}
+
+ if (pdev->vendor == PCI_VENDOR_ID_AMD) {
+ while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd,
p))) {
+ pci_bridge = kmalloc(sizeof(struct tb_pci_bridge),
GFP_KERNEL);
+ if (!pci_bridge)
+ goto cleanup;
+
+ pci_bridge->bridge = p;
+ INIT_LIST_HEAD(&pci_bridge->list);
+ list_add(&pci_bridge->list, &tb->bridge_list);
+ }
+ }

You can't walk the device tree and create a "shadow" list of devices
like this and expect any lifetime rules to work properly with them at
all.

Please do not do this.

greg k-h