RE: [PATCH] btusb: fix zero BD address problem during stress test

From: Amitkumar Karwar
Date: Wed Nov 23 2016 - 04:44:33 EST


Hi Marcel,

> From: Marcel Holtmann [mailto:marcel@xxxxxxxxxxxx]
> Sent: Wednesday, November 23, 2016 1:46 PM
> To: Amitkumar Karwar
> Cc: linux-bluetooth@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> Cathy Luo; Nishant Sarmukadam; Ganapathi Bhat
> Subject: Re: [PATCH] btusb: fix zero BD address problem during stress
> test
>
> Hi Amitkumar,
>
> >>>> From: Amitkumar Karwar [mailto:akarwar@xxxxxxxxxxx]
> >>>> Sent: Tuesday, October 18, 2016 6:27 PM
> >>>> To: linux-bluetooth@xxxxxxxxxxxxxxx
> >>>> Cc: marcel@xxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Cathy Luo;
> >>>> Nishant Sarmukadam; Ganapathi Bhat; Amitkumar Karwar
> >>>> Subject: [PATCH] btusb: fix zero BD address problem during stress
> >>>> test
> >>>>
> >>>> From: Ganapathi Bhat <gbhat@xxxxxxxxxxx>
> >>>>
> >>>> We came across a corner case issue during reboot stress test in
> >> which
> >>>> hciconfig shows BD address is all zero. Reason is we don't get
> >>>> response for HCI RESET command during initialization
> >>>>
> >>>> The issue is tracked to a race where USB subsystem calls
> >>>> btusb_intr_complete() to deliver a data(NOOP frame) received on
> >>>> interrupt endpoint. HCI_RUNNING flag is not yet set by bluetooth
> >>>> subsystem. So we ignore that frame and return.
> >>>>
> >>>> As we missed to resubmit the buffer to interrupt endpoint in this
> >>>> case, we don't get response for BT reset command downloaded after
> >> this.
> >>>>
> >>>> This patch handles the corner case to resolve zero BD address
> >> problem.
> >>>>
> >>>> Signed-off-by: Ganapathi Bhat <gbhat@xxxxxxxxxxx>
> >>>> Signed-off-by: Amitkumar Karwar <akarwar@xxxxxxxxxxx>
> >>>> ---
> >>>> drivers/bluetooth/btusb.c | 5 +----
> >>>> 1 file changed, 1 insertion(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
> >>>> index 811f9b9..b5596ac 100644
> >>>> --- a/drivers/bluetooth/btusb.c
> >>>> +++ b/drivers/bluetooth/btusb.c
> >>>> @@ -607,10 +607,7 @@ static void btusb_intr_complete(struct urb
> >> *urb)
> >>>> BT_DBG("%s urb %p status %d count %d", hdev->name, urb, urb-
> >>>>> status,
> >>>> urb->actual_length);
> >>>>
> >>>> - if (!test_bit(HCI_RUNNING, &hdev->flags))
> >>>> - return;
> >>>> -
> >>>> - if (urb->status == 0) {
> >>>> + if (urb->status == 0 && test_bit(HCI_RUNNING, &hdev-
> >flags)) {
> >>>> hdev->stat.byte_rx += urb->actual_length;
> >>>>
> >>>> if (btusb_recv_intr(data, urb->transfer_buffer,
> >>>
> >>> Did you get a chance to check this?
> >>> Please let us know if you have any review comments.
> >>
> >> can you explain how this is correct and show me the HCI traces for
> >> this.
> >>
> >
> > I suppose HCI trace means hcidump logs here. As device hasn't yet
> initialized, hcidump won't show anything.
> > We had added debug info in btusb driver to trace the data received on
> all USB endpoints and also checked usbmon logs.
>
> use btmon and it will show it.
>
> > Here is the sequence of events we observed in a corner case while
> running stress test.
> > 1) Inside btusb_open() call ------ Thread 1
> > 2) btusb_submit_intr_urb() submits the URB for receiving data on
> > interrupt endpoint ---- Thread 1
> > 3) btusb_intr_complete() gets called to deliver NOP frame from HCI
> > controller ---- Thread 2
> > 4) HCI_RUNNING isn't set yet. So we return from btusb_intr_complete()
> > without resubmitting the buffer --- Thread 2
> > 5) Exit btusb_open() ---- Thread 1
> > 6) "set_bit(HCI_RUNNING, &hdev->flags)" done by bluetooth core ----
> > Thread 1
> >
> > Later HCI_RESET command gets timedout, as we haven't re-submitted
> buffer for interrupt endpoint in step (4) above.
> >
> > Please find attached logs. usbmon log shows first frame received on
> interrupt endpoint is NOP(Search for Marvell in log).
> >
> > Here is what bluetooth spec says about NOP frame.
> >
> > "To indicate to the Host that the Controller is ready to receive HCI
> > command packets, the Controller generates a Command Status event with
> > Status 0x00 and Command_Opcode 0x0000, and the
> Num_HCI_Command_Packets
> > event parameter is set to 1 or more. Command_Opcode, 0x0000 is a NOP
> (No Operation)â
>
> So I wonder if we need to remove the HCI_RUNNING logic from the
> drivers. It is only left in a few USB drivers and I removed the rest
> and moved it into the core. I am not in favour of papering over this
> issue. I need to understand what is wrong. And actually HCI_RUNNING
> needs to be taken away from the drivers. So the question is if btusb.c
> still needs it or if that is just a leftover. Meaning is it protecting
> anything?
>

I can see HCI_RUNNING check being used at below places in btusb.c

1) btusb_intr_complete
2) btusb_bulk_complete
3) btusb_isoc_complete
4) btusb_tx_complete
5) btusb_isoc_tx_complete
6) btusb_bcm_set_diag
7) btusb_resume

I think, it can be removed from (1), (2) and (3), as we already have below check inside hci_recv_frame()

if (!hdev || (!test_bit(HCI_UP, &hdev->flags)
&& !test_bit(HCI_INIT, &hdev->flags))) {
kfree_skb(skb);
return -ENXIO;
}

For (4) and (5), it's just used to decide if we should update "hdev->stat" or not.

Regards,
Amitkumar