Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout

From: Mathias Nyman
Date: Thu Mar 31 2016 - 10:30:31 EST


On 31.03.2016 06:51, Rajesh Bhagat wrote:



Hello Mathias,

Thanks a lot for your support.

Attached patch is working and allows the completion handler to be called. And resume operation completes
and shell comes back (but with lot of errors).

Now, some other points which needs our attention here are:
1. usb core hub code is not checking the error status of hcd->driver->reset_device(xhci_discover_or_reset_device)
and continues considering reset_device was successful (and causes other issues). Hence, a check is needed on return
value of reset_device and proper action is required on it.
2. Even if above return value is checked and reset_device status is used. USB core hub retries for N times which is not
required in this case as adding to the resume operation time. But if in this case we return -ENOTCONN instead of -EINVAL
from hcd->driver->reset_device(xhci_discover_or_reset_device), code returns early and doesn't retry.

Proposed Solution for above with your suggested patches is as below:

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 7cad1fa..efeba31 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -2807,13 +2807,17 @@ done:
struct usb_hcd *hcd = bus_to_hcd(udev->bus);

update_devnum(udev, 0);
- /* The xHC may think the device is already reset,
- * so ignore the status.
+ /*
+ * check the status of resetting the device and update
+ * the udev status.
*/
if (hcd->driver->reset_device)
- hcd->driver->reset_device(hcd, udev);
+ status = hcd->driver->reset_device(hcd, udev);

- usb_set_device_state(udev, USB_STATE_DEFAULT);
+ if (status == 0)
+ usb_set_device_state(udev, USB_STATE_DEFAULT);
+ else
+ usb_set_device_state(udev, USB_STATE_NOTATTACHED);
}
} else {
if (udev)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index b3a0a22..9427175f 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -310,6 +310,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
return -ESHUTDOWN;
}

+ /* writing the CMD_RING_ABORT bit should create a command completion
+ * event, add a command completion timeout for it as well
+ */
+ mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
return 0;
}

@@ -1243,6 +1247,7 @@ void xhci_handle_command_timeout(unsigned long data)
int ret;
unsigned long flags;
u64 hw_ring_state;
+ u32 old_status;
struct xhci_command *cur_cmd = NULL;
xhci = (struct xhci_hcd *) data;

@@ -1250,6 +1255,7 @@ void xhci_handle_command_timeout(unsigned long data)
spin_lock_irqsave(&xhci->lock, flags);
if (xhci->current_cmd) {
cur_cmd = xhci->current_cmd;
+ old_status = cur_cmd->status;
cur_cmd->status = COMP_CMD_ABORT;
}

@@ -1272,7 +1278,15 @@ void xhci_handle_command_timeout(unsigned long data)
}
/* command timeout on stopped ring, ring can't be aborted */
xhci_dbg(xhci, "Command timeout on stopped ring\n");
- xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+
+ /* is this the second timeout for the same command? ring wont restart */
+ if (old_status == COMP_CMD_ABORT) {
+ xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
+ xhci_cleanup_command_queue(xhci);
+ /* everything else here to halt, cleanup, free etc kill xhc */
+ } else
+ xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+
spin_unlock_irqrestore(&xhci->lock, flags);
return;
}
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index c502c22..bd18966 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -3450,7 +3450,7 @@ int xhci_discover_or_reset_device(struct usb_hcd *hcd, struct usb_device *udev)
if (ret == 1)
return 0;
else
- return -EINVAL;
+ return -ENOTCONN; /* Don't retry */
}

if (virt_dev->tt_info)

Sample output after above patch (timer is set as wakeup source):

root@phoenix:~# echo mem > /sys/power/state
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.001 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
PM: suspend of devices complete after 155.658 msecs
PM: late suspend of devices complete after 1.594 msecs
PM: noirq suspend of devices complete after 1.496 msecs
PM: noirq resume of devices complete after 1.290 msecs
PM: early resume of devices complete after 1.250 msecs
usb usb1: root hub lost power or was reset
usb usb2: root hub lost power or was reset
xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
PM: resume of devices complete after 25436.366 msecs <= resume time is increased a lot even after using -ENOTCONN
Restarting tasks ...
usb 1-1: USB disconnect, device number 2
usb 1-1.2: USB disconnect, device number 3
usb 2-1: USB disconnect, device number 2
done.
root@phoenix:~#

Please share your opinion on other changes for patch submission as well as resume time.


I think more effort should be put into investigating why this happens in the first place.
What is the root cause? why doesn't xhci start properly after resume in this case.

Optimizing resume time and error paths should be secondary here, resuming faster isn't really
a matter when xhci is completely stuck.

-Mathias