Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within aUSB payload burst

From: Sarah Sharp
Date: Mon Jan 06 2014 - 19:32:05 EST


On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote:
> > I'm so sorry Sarah, that was another mistake. The mistake is so stupid I'm not
> > going to publish it here :(
> >
> > Once I finally ran the kernel with debugging actually compiled in, dmesg contains
> > xhci debugging messages. Wow :)
> >
> > It's a big file so I zipped and attached it, which I hope is acceptable in lkml.
>
> Yep, that's fine. Sticking it in pastebin (or up on your server) is
> also fine, if it gets really big.
>
> > BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said before
> > doesn't fix the problem. The cp stopped around 7GB just as before.
> >
> > Sorry for the noise...
>
> No worries! :) With the dmesg, I can finally see what happened:
>
> [ 188.703059] xhci_hcd 0000:03:00.0: Cancel URB ffff8800b7d2e0c0, dev 1, ep 0x2, starting at offset 0xbb7b9000
> [ 188.703072] xhci_hcd 0000:03:00.0: // Ding dong!
> [ 193.711022] xhci_hcd 0000:03:00.0: xHCI host not responding to stop endpoint command.
> [ 193.711029] xhci_hcd 0000:03:00.0: Assuming host is dying, halting host.
> [ 193.711046] xhci_hcd 0000:03:00.0: // Halt the HC
> [ 193.711060] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 0
> [ 193.711066] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 2
> [ 193.711078] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 3
> [ 193.711096] xhci_hcd 0000:03:00.0: Calling usb_hc_died()
> [ 193.711103] xhci_hcd 0000:03:00.0: HC died; cleaning up
> [ 193.711116] xhci_hcd 0000:03:00.0: xHCI host controller is dead.
>
> It seems that the xHCI driver tried to stop the endpoint ring in order
> to cancel a SCSI transfer, and the driver never got a response for that.
>
> The offset is rather suspicious (0xbb7b9000), and it probably means the
> driver attempted to cancel a transfer that had been moved to the
> beginning of the ring segment, with no-op TRBs before the link TRB.
>
> I suspect David's patch triggers a bug in the command cancellation code.
> There's also the unlikely possibility that the no-op TRBs did indeed
> cause the host to hang. Either way, I'll have to look into it.
>
> I'll let you know when I have some diagnostic patches ready.

Hi Walt,

I have a couple of patches for you to test. You can either apply the
attached three patches, or you can pull down a kernel with:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci.git -b 3.12-td-fragment-failure

Please only apply the first patch (which is diagnostic only), trigger
your issue, and send me the resulting dmesg. Then try applying the
other two patches, and see if the issue goes away. (I suspect it won't
but I can't be sure.)

Sarah Sharp