Re: [PATCH v2] PCI: pciehp: Report degraded links via link bandwidth notification

From: Lukas Wunner
Date: Sun Feb 24 2019 - 21:29:03 EST


On Fri, Dec 07, 2018 at 12:20:00PM -0600, Alexandru Gagniuc wrote:
> A warning is generated when a PCIe device is probed with a degraded
> link, but there was no similar mechanism to warn when the link becomes
> degraded after probing. The Link Bandwidth Notification provides this
> mechanism.
>
> Use the link bandwidth notification interrupt to detect bandwidth
> changes, and rescan the bandwidth, looking for the weakest point. This
> is the same logic used in probe().

This is unrelated to pciehp, so the subject prefix should be changed
to "PCI/portdrv".


> Q: Why is this unconditionally compiled in?
> A: The symmetrical check in pci probe() is also always compiled in.

Hm, it looks like the convention is to provide a separate Kconfig entry
for each port service.


> Q: Why call module_init() instead of adding a call in pcie_init_services() ?
> A: A call in pcie_init_services() also requires a prototype in portdrv.h, a
> non-static implementation in bw_notification.c. Using module_init() is
> functionally equivalent, and takes less code.

Commit c29de84149ab ("PCI: portdrv: Initialize service drivers directly")
moved away from module_init() on purpose, apparently to fix a race
condition.


> Q: Why print only on degraded links and not when a link is back to full speed?
> For symmetry with PCI probe(). Although I see a benefit in conveying that a
> link is back to full speed, I expect this to be extremely rare. Secondary bus
> reset is usually needed to retrain at full bandwidth.

What if the link is retrained at the same speed/width? Intuitively
I'd compare the speed in the Link Status Register to what is cached
in the cur_bus_speed member of struct pci_bus and print a message
only if the speed has changed. (Don't we need to cache the width as
well?)


> +static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
> +{
> + struct pcie_device *srv = context;
> + struct pci_dev *port = srv->port;
> + struct pci_dev *dev;
> + u16 link_status, events;
> + int ret;
> +
> + ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, &link_status);
> + events = link_status & PCI_EXP_LNKSTA_LBMS;
> +
> + if (!events || ret != PCIBIOS_SUCCESSFUL)
> + return IRQ_NONE;
> +
> + /* Print status from upstream link partner, not this downstream port. */
> + list_for_each_entry(dev, &port->subordinate->devices, bus_list)
> + __pcie_print_link_status(dev, false);
> +
> + pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
> + return IRQ_HANDLED;
> +}

You need to hold pci_bus_sem for the list iteration to protect against
simultaneous removal of child devices.

This may sleep, so request the IRQ with request_threaded_irq(), pass NULL
for the "handler" argument and pcie_bw_notification_irq for the "thread_fn"
argument.

You may want to call pcie_update_link_speed() to update the now stale
speed cached in the pci_bus struct.


> + ret = devm_request_irq(&srv->device, srv->irq, pcie_bw_notification_irq,
> + IRQF_SHARED, "PCIe BW notif", srv);

The plan is to move away from port services as devices in the future,
so device-managed allocations should be avoided.

Apart from that, with a device-managed IRQ, if this service driver is
unbound, its IRQ handler may still be invoked if some other port service
signals an interrupt (because the IRQ is shared). Which seems wrong.


> + .port_type = PCI_EXP_TYPE_DOWNSTREAM,

According to PCIe r4.0, sec 7.8.6, Link Bandwidth Notification Capability
is also required for root ports, so I think you need to match for
PCIE_ANY_PORT and amend pcie_link_bandwidth_notification_supported()
to check that you're dealing with a root or downstream port.


> + .service = PCIE_PORT_SERVICE_BwNOTIF,

All caps please.


> @@ -136,9 +136,12 @@ static int pcie_port_enable_irq_vec(struct pci_dev *dev, int *irqs, int mask)
> }
>
> /* PME and hotplug share an MSI/MSI-X vector */
> - if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP)) {
> - irqs[PCIE_PORT_SERVICE_PME_SHIFT] = pci_irq_vector(dev, pme);
> - irqs[PCIE_PORT_SERVICE_HP_SHIFT] = pci_irq_vector(dev, pme);
> + if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP |
> + PCIE_PORT_SERVICE_BwNOTIF)) {
> + pcie_irq = pci_irq_vector(dev, pme);
> + irqs[PCIE_PORT_SERVICE_PME_SHIFT] = pcie_irq;
> + irqs[PCIE_PORT_SERVICE_HP_SHIFT] = pcie_irq;
> + irqs[PCIE_PORT_SERVICE_BWNOTIF_SHIFT] = pcie_irq;

Please amend the code comment with something like

- /* PME and hotplug share an MSI/MSI-X vector */
+ /* PME, hotplug and bandwidth notification share an MSI/MSI-X vector */


> @@ -250,6 +253,9 @@ static int get_port_device_capability(struct pci_dev *dev)
> pci_aer_available() && services & PCIE_PORT_SERVICE_AER)
> services |= PCIE_PORT_SERVICE_DPC;
>
> + if (pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM)
> + services |= PCIE_PORT_SERVICE_BwNOTIF;
> +

Also need to check for root ports, see above.


Otherwise LGTM.

Thanks,

Lukas