Re: [PATCH] Reset PCIe devices to stop ongoing DMA

From: Takao Indoh
Date: Wed May 08 2013 - 04:34:58 EST


(2013/05/08 7:04), Alex Williamson wrote:
> On Tue, 2013-05-07 at 16:10 -0400, Don Dutile wrote:
>> On 05/07/2013 12:39 PM, Alex Williamson wrote:
>>> On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:
>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>
>>>> Problem:
>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>> occur in second kernel and it causes problems like driver error or PCI
>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>> 1) Devices are working on first kernel.
>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>> and its DMA continues during this switch.
>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>> dma-remapping errors.
>>>>
>>>> Solution:
>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>> this patch devices are reset and in-flight DMA is stopped before
>>>> pci_iommu_init.
>>>>
>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>> monitor blacks out if VGA controller is reset.
>>>>
>>>> Actually this is v8 patch but quite different from v7 and it's been so
>>>> long since previous post, so I start over again.
>>>> Previous post:
>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
>>>> https://lkml.org/lkml/2012/11/26/814
>>>>
>>>> Signed-off-by: Takao Indoh<indou.takao@xxxxxxxxxxxxxx>
>>>> ---
>>>> Documentation/kernel-parameters.txt | 2 +
>>>> drivers/pci/pci.c | 103 +++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 105 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>>>> index 4609e81..2a31ade 100644
>>>> --- a/Documentation/kernel-parameters.txt
>>>> +++ b/Documentation/kernel-parameters.txt
>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>>> any pair of devices, possibly at the cost of
>>>> reduced performance. This also guarantees
>>>> that hot-added devices will work.
>>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot
>>>> + reset
>>>> cbiosize=nn[KMG] The fixed amount of bus space which is
>>>> reserved for the CardBus bridge's IO window.
>>>> The default value is 256 bytes.
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index b099e00..42385c9 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>>> }
>>>> EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>
>>>> +/*
>>>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
>>>> + * endpoint except VGA device.
>>>> + */
>>>> +static int __init need_reset(struct pci_dev *dev)
>>>> +{
>>>> + struct pci_bus *subordinate;
>>>> + struct pci_dev *child;
>>>> +
>>>> + if (!pci_is_pcie(dev) || !dev->subordinate ||
>>>> + list_empty(&dev->subordinate->devices) ||
>>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>>>> + return 0;
>>>> +
>>>> + subordinate = dev->subordinate;
>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY))
>>>> + /* Don't reset switch, bridge, VGA device */
>>>> + return 0;
>>>> + }
>>>> +
>>>> + return 1;
>>>> +}
>>>> +
>>>> +static void __init save_config(struct pci_dev *dev)
>>>> +{
>>>> + struct pci_bus *subordinate;
>>>> + struct pci_dev *child;
>>>> +
>>>> + if (!need_reset(dev))
>>>> + return;
>>>> +
>>>> + subordinate = dev->subordinate;
>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>> + dev_info(&child->dev, "save state\n");
>>>> + pci_save_state(child);
>>>> + }
>>>> +}
>>>> +
>>>> +static void __init restore_config(struct pci_dev *dev)
>>>> +{
>>>> + struct pci_bus *subordinate;
>>>> + struct pci_dev *child;
>>>> +
>>>> + if (!need_reset(dev))
>>>> + return;
>>>> +
>>>> + subordinate = dev->subordinate;
>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>> + dev_info(&child->dev, "restore state\n");
>>>> + pci_restore_state(child);
>>>> + }
>>>> +}
>>>> +
>>>> +static void __init do_device_reset(struct pci_dev *dev)
>>>> +{
>>>> + u16 ctrl;
>>>> +
>>>> + if (!need_reset(dev))
>>>> + return;
>>>> +
>>>> + dev_info(&dev->dev, "Reset Secondary bus\n");
>>>> +
>>>> + /* Assert Secondary Bus Reset */
>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>> +
>>>> + msleep(2);
>>>> +
>>>> + /* De-assert Secondary Bus Reset */
>>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>> +}
>>>> +
>>>> +static int __initdata pcie_reset_devices;
>>>> +static int __init reset_pcie_devices(void)
>>>> +{
>>>> + struct pci_dev *dev = NULL;
>>>> +
>>>> + if (!pcie_reset_devices)
>>>> + return 0;
>>>> +
>>>> + for_each_pci_dev(dev)
>>>> + save_config(dev);
>>>> +
>>>> + for_each_pci_dev(dev)
>>>> + do_device_reset(dev);
>>>
>>> So we do a reset on each root port or downstream port, but the PCI to
>>> PCI bridge spec (1.2) states:
>>>
>>> 11.1.1. Secondary Reset Signal
>>> The secondary reset signal, RST#, is a logical OR of the
>>> primary interface RST# signal and
>>> the state of the Secondary Bus Reset bit of the Bridge
>>> Control register (see Section 3.2.5.18).
>>>
>>> I read that to mean that in a legacy topology, doing a reset on the top
>>> level bridge triggers a reset down the entire hierarchy. How does this
>>> apply to a PCIe topology? Can we just do a reset on the top level root
>>> ports? If so, that would also imply that a reset propagates down
>> On PCIe, a reset does not propagate down the tree (from upstream link to
>> downstream link of a bridge/switch).
>
> I don't think this is actually true. Section 6.6.1 of the spec defines
> 3 types of conventional resets, cold, warm and hot. I believe the one
> we're interested in is hot, which is an in-band mechanism for
> propagating a conventional reset across a link. Section 4.2.5.11
> defines hot reset:
>
> Hot Reset uses bit 0 (Hot Reset) in the Training Control field
> (see Table 4-5 and Table 4-6) within the TS1 and TS2 Ordered
> Sets.
>
> A Link can enter Hot Reset if directed by a higher Layer. A Link
> can also reach the Hot Reset state by receiving two consecutive
> TS1 Ordered Sets with the Hot Reset bit asserted (see Section
> 4.2.6.11).
>
> Section 4.2.6.11 then goes on as:
>
> * Lanes that were not directed by a higher Layer to initiate Hot
> Reset (i.e., received two consecutive TS1 Ordered Sets with the
> Hot Reset bit asserted on any configured Lanes):
> * If any Lane of an Upstream Port of a Switch receives two
> consecutive TS1 Ordered Sets with the Hot Reset bit
> asserted, all configured Downstream Ports must
> transition to Hot Reset as soon as possible.
> * All Lanes in the configured Link transmit TS1 Ordered
> Sets with the Hot Reset bit asserted and the configured
> Link and Lane numbers.
>
> Also included is the footnote:
>
> Note: Generally, Lanes of a Downstream or optional crosslink
> Port will be directed to Hot Reset, and Lanes of an Upstream or
> optional crosslink Port will enter Hot Reset by receiving two
> consecutive TS1 Ordered Sets with the Hot Reset bit asserted on
> any configured Lanes...
>
> So I think that means that if a hot reset is received by an upstream
> port, it's propagated to the downstream ports and on to the downstream
> links. The latter point is where there's potential for interpretation.
>
> Going back to sections 6.6.1, we can generate a hot reset via:
>
> * For any Root or Switch Downstream Port, setting the Secondary
> Bus Reset bit of the Bridge Control register associated with the
> Port must cause a hot reset to be sent.
> * For a Switch, the following must cause a hot reset to be sent on
> all Downstream Ports:
> * Setting the Secondary Bus Reset bit of the Bridge
> Control register associated with the Upstream Port
> * Receiving a hot reset on the Upstream Port
>
> Sounds like it propagates to me. But, re-reading this patch, I think
> the goal is to attempt a bus reset on the most downstream
> root/downstream port, but it's pretty confusing.

Yes, I also think a hot reset is propagated to all downstream link. This
patch does bus reset only on root port/downstream port whose children
are endpoints since I need to skip display class devices to avoid
monitor blacks out.

Thanks,
Takao Indoh

>
> I also have a need to add a bus reset interface, in my case though the
> end goal is specifically to reset display class devices to get them into
> a usable state. I just want to provide the kernel interfaces though,
> it's up to the drivers how to use them. Thanks,
>
> Alex
>
>>> through PCIe-to-PCI bridges, so legacy PCI devices may be reset and the
>>> option name is misleading.
>>>
>> In earlier reply, I stated that the functions ought to be renamed to
>> reflect their intentions: endpoint reset.
>> It is not a reset-all-pci-devices interface, as it implies
>>
>>> Even so, you still have root complex endpoints, which are not getting
>>> reset through this, so it's really not a complete solution. Thanks,
>>>
>>> Alex
>>>
>>>> +
>>>> + msleep(1000);
>>>> +
>>>> + for_each_pci_dev(dev)
>>>> + restore_config(dev);
>>>> +
>>>> + return 0;
>>>> +}
>>>> +fs_initcall_sync(reset_pcie_devices);
>>>> +
>>>> static int __init pci_setup(char *str)
>>>> {
>>>> while (str) {
>>>> @@ -3920,6 +4021,8 @@ static int __init pci_setup(char *str)
>>>> pcie_bus_config = PCIE_BUS_PEER2PEER;
>>>> } else if (!strncmp(str, "pcie_scan_all", 13)) {
>>>> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
>>>> + } else if (!strncmp(str, "pcie_reset_devices", 18)) {
>>>> + pcie_reset_devices = 1;
>>>> } else {
>>>> printk(KERN_ERR "PCI: Unknown option `%s'\n",
>>>> str);
>>>
>>>
>>>
>>> _______________________________________________
>>> iommu mailing list
>>> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
>
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/