Re: [PATCH 2/2] PCI,pciehp: avoid add a device already exist duringpciehp_resume

From: Yijing Wang
Date: Tue Jul 09 2013 - 23:02:38 EST


Hi Bjorn,
Thanks for your review and comments!

>> We can use PCIe Device Serial Number to identify the device if
>> device support DSN.
>
> I think I like the idea of this, especially because the Microsoft PCI
> Hardware Compliance Test apparently requires DSN for hot-pluggable
> PCIe devices [1], so it should be pretty universal.
>
> [1] http://www.techtalkz.com/microsoft-device-drivers-dtm/341362-dtm-pcihct-test-violates-pci-express-base-specification-revision-1-a.html
>
>> currently:
>> 1. slot is empty before suspend, insert card during suspend.
>> In this case, is safe, pciehp will add device by check adapter
>> status register in pciehp_resume.
>
> Your patch doesn't change anything here.

Yes, I only to make some changes for case 3/4.

>
>> 2. slot is non empty before suspend, remove card during suspend.
>> Also be safe, pciehp will remove device by check adapter
>> status register in pciehp_resume.
>
> Your patch doesn't change anything here. (But I think the driver
> .remove() method will try to poke at the non-existent device; see
> below.)

I'm not sure the result of driver .remove() method to poke at the non-existent device.
If driver .remove() method cannot detect the real device, remove action will be block ?
If the slot support surprise hot remove, this action maybe safe. right?

If the slot does not support surprise hot remove, but the device was already removed,
we seem to have no other way to clean the stale data related to the old device.

Now if we check adapter status in slot and found adapter is non existent, pciehp resume
call pciehp_disable_slot() , in pciehp_disable_slot() function, we will check latch status,
I guess this case latch is open(because slot is empty), this action will abort.
But I have no platform to test it.

>
>> 3. slot is non empty before suspend, remove card during suspend
>> and insert a new card.
>> Now pciehp just call pciehp_enable_slot() roughly. We should
>> remove the old card firstly, then add the new card.
>
> With your patch, I think we'll call the old driver's .remove() method
> on the new device. This seems bad; see below.

Ah, this is issue.
What about power off slot first, then call the old driver's remove() method
will not touch the new physical device. After the old driver's remove() cleanup,
we call pciehp_enable_slot() to power on and enable the new device.

>
> With your patch, if we remove and reinsert the same device while
> suspended, we do nothing because the DSN didn't change. Previously we
> called pciehp_enable_slot(). I don't know if we need to do anything
> here or not.

Mainly to avoid the redundant device add, the same like the changes for case 4

>
>> 4. slot is non empty before suspend, no action during suspend.
>> We should do nothing in pciehp_resume, but we call
>> pciehp_enable_slot(), so some uncomfortable messages show like above.
>> In this case, we can improve it a little by add a guard
>> if (!list_empty(bus->devices)).
>
> This is the common case. Previously we called pciehp_enable_slot(),
> and with your patch we do nothing. I think that seems sensible, but
> this part should be split into a separate patch. That way we can keep
> the benefit of this change even if we trip over something with the
> other changes.

OK, I will split this changes into a new patch.

>
>> Reported-by: Paul Bolle <pebolle@xxxxxxxxxx>
>> Signed-off-by: Yijing Wang <wangyijing@xxxxxxxxxx>
>> Cc: Paul Bolle <pebolle@xxxxxxxxxx>
>> Cc: "Rafael J. Wysocki" <rjw@xxxxxxx>
>> Cc: Oliver Neukum <oneukum@xxxxxxx>
>> Cc: Gu Zheng <guz.fnst@xxxxxxxxxxxxxx>
>> Cc: linux-pci@xxxxxxxxxxxxxxx
>> ---
>> drivers/pci/hotplug/pciehp_core.c | 38 +++++++++++++++++++++++++++++++++---
>> 1 files changed, 34 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
>> index 7d72c5e..d01e093 100644
>> --- a/drivers/pci/hotplug/pciehp_core.c
>> +++ b/drivers/pci/hotplug/pciehp_core.c
>> @@ -291,6 +291,28 @@ static void pciehp_remove(struct pcie_device *dev)
>> }
>>
>> #ifdef CONFIG_PM
>> +
>> +/* If device support Device Serial Numner, use DSN
>
> s/support/supports/
> s/Numner/Number/
> Use conventional comment style:
> /*
> * If device ...
> */
>

Will update, thanks.


>> + * to identify the device
>> + */
>> +static bool device_in_slot_is_changed(struct pci_bus *pbus)
>> +{
>> + u64 old_dsn, new_dsn;
>> + struct pci_dev *pdev;
>> +
>> + pdev = pci_get_slot(pbus, PCI_DEVFN(0, 0));
>
> pci_get_slot() can fail.

Will add failure return check, thanks.

>
>> + old_dsn = pdev->sn;
>> +
>> + /* get func 0 device serial number */
>> + pci_get_dsn(pdev, &new_dsn);
>> + if (status) {
>> + if (list_empty(&pbus->devices))
>> + pciehp_enable_slot(slot);
>> + else if (device_in_slot_is_changed(pbus)) {
>> + pciehp_disable_slot(slot);
>
> pciehp_disable_slot() ultimately calls the .remove() method for the
> device that has already been removed. That method is likely to poke
> at the device, which will do something unexpected because the new
> device is likely to be completely different.

Yes, this is a really issue, so what about firstly power off the slot before call
.remove() method for the old device.

>
> I think the call path is this:
>
> pciehp_resume
> pciehp_disable_slot
> remove_board
> pciehp_unconfigure_device
> pci_stop_and_remove_bus_device
> pci_stop_bus_device
> pci_stop_dev
> device_del
> bus_remove_device
> device_release_driver
> __device_release_driver
> pci_device_remove # pci_bus_type.remove
> dev->driver->remove
>
> I think we already had this problem for case 2 (card removed while
> suspended), but at least in that case, the driver .remove() method
> doesn't have a device to poke at, so it's less likely to do something
> bad.

Agree.

>
>> + pciehp_enable_slot(slot);
>> + }
>> + } else {
>> + if (!list_empty(&pbus->devices))
>> + pciehp_disable_slot(slot);
>> + }
>> return 0;
>> }
>> #endif /* PM */
>> --
>> 1.7.1
>>
>>
>
> .
>


--
Thanks!
Yijing

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/