Re: [PATCH] xhci-mtk: Fix NULL pointer dereference with xhci_irq() for shared_hcd

From: Macpaul Lin
Date: Thu Mar 05 2020 - 21:36:25 EST


On Thu, 2020-03-05 at 19:32 +0100, Greg Kroah-Hartman wrote:
> On Thu, Mar 05, 2020 at 10:58:46AM +0800, Macpaul Lin wrote:
> > On Wed, 2020-03-04 at 16:39 +0200, Mathias Nyman wrote:
> > > On 4.3.2020 5.16, Macpaul Lin wrote:
> > > > On Tue, 2020-02-04 at 17:44 +0800, Mathias Nyman wrote:
> > > >> On 1.2.2020 13.20, Macpaul Lin wrote:
> > > >>> On Fri, 2020-01-31 at 16:50 +0200, Mathias Nyman wrote:
> > > >>>> On 17.1.2020 9.41, Macpaul Lin wrote:
> > > >>>>> According to NULL pointer fix: https://tinyurl.com/uqft5ra
> > > >>>>> xhci: Fix NULL pointer dereference with xhci_irq() for shared_hcd
> > > >>>>> The similar issue has also been found in QC activities in Mediatek.
> > > >>>>>
> > > >>>>> Here quote the description from the referenced patch as follows.
> > > >>>>> "Commit ("f068090426ea xhci: Fix leaking USB3 shared_hcd
> > > >>>>> at xhci removal") sets xhci_shared_hcd to NULL without
> > > >>>>> stopping xhci host. This results into a race condition
> > > >>>>> where shared_hcd (super speed roothub) related interrupts
> > > >>>>> are being handled with xhci_irq happens when the
> > > >>>>> xhci_plat_remove is called and shared_hcd is set to NULL.
> > > >>>>> Fix this by setting the shared_hcd to NULL only after the
> > > >>>>> controller is halted and no interrupts are generated."
> > > >>>>>
> > > >>>>> Signed-off-by: Sriharsha Allenki <sallenki@xxxxxxxxxxxxxx>
> > > >>>>> Signed-off-by: Macpaul Lin <macpaul.lin@xxxxxxxxxxxx>
> > > >>>>> ---
> > > >>>>> drivers/usb/host/xhci-mtk.c | 2 +-
> > > >>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>>>>
> > > >>>>> diff --git a/drivers/usb/host/xhci-mtk.c b/drivers/usb/host/xhci-mtk.c
> > > >>>>> index b18a6baef204..c227c67f5dc5 100644
> > > >>>>> --- a/drivers/usb/host/xhci-mtk.c
> > > >>>>> +++ b/drivers/usb/host/xhci-mtk.c
> > > >>>>> @@ -593,11 +593,11 @@ static int xhci_mtk_remove(struct platform_device *dev)
> > > >>>>> struct usb_hcd *shared_hcd = xhci->shared_hcd;
> > > >>>>>
> > > >>>>> usb_remove_hcd(shared_hcd);
> > > >>>>> - xhci->shared_hcd = NULL;
> > > >>>>> device_init_wakeup(&dev->dev, false);
> > > >>>>>
> > > >>>>> usb_remove_hcd(hcd);
> > > >>>>> usb_put_hcd(shared_hcd);
> > > >>>>> + xhci->shared_hcd = NULL;
> > > >>>>> usb_put_hcd(hcd);
> > > >>>>> xhci_mtk_sch_exit(mtk);
> > > >>>>> xhci_mtk_clks_disable(mtk);
> > > >>>>>
> > > >>>>
> > > >>>> Could you share details of the NULL pointer dereference, (backtrace).
> > > >>>
> > > >>> This bug was found by our QA staff while doing 500 times plug-in and
> > > >>> plug-out devices. The backtrace I have was recorded by QA and I didn't
> > > >>> reproduce this issue on my own environment. However, after applied this
> > > >>> patch the issue seems resolve. Here is the backtrace:
> > > >>>
> > > >>> Exception Class: Kernel (KE)
> > > >>> PC is at [<ffffff8008cccbc0>] xhci_irq+0x728/0x2364
> > > >>> LR is at [<ffffff8008ccc788>] xhci_irq+0x2f0/0x2364
> > > >>>
> > > >>> Current Executing Process:
> > > >>> [iptables, 859][netdagent, 770]
> > > >>>
> > > >>> Backtrace:
> > > >>> [<ffffff80080ead58>] __atomic_notifier_call_chain+0xa8/0x130
> > > >>> [<ffffff80080eb6d4>] notify_die+0x84/0xac
> > > >>> [<ffffff800808e874>] die+0x1d8/0x3b8
> > > >>> [<ffffff80080a89b0>] __do_kernel_fault+0x178/0x188
> > > >>> [<ffffff80080a81b4>] do_page_fault+0x44/0x3b0
> > > >>> [<ffffff80080a811c>] do_translation_fault+0x44/0x98
> > > >>> [<ffffff8008080e08>] do_mem_abort+0x4c/0x128
> > > >>> [<ffffff80080832d0>] el1_da+0x24/0x3c
> > > >>> [<ffffff8008cccbc0>] xhci_irq+0x728/0x2364
> > > >>> [<ffffff8008c98804>] usb_hcd_irq+0x2c/0x44
> > > >>> [<ffffff8008179bb0>] __handle_irq_event_percpu+0x26c/0x4a4
> > > >>> [<ffffff8008179ec8>] handle_irq_event+0x5c/0xd0
> > > >>> [<ffffff800817e3c0>] handle_fasteoi_irq+0x10c/0x1e0
> > > >>> [<ffffff80081787b0>] __handle_domain_irq+0x32c/0x738
> > > >>> [<ffffff800808159c>] gic_handle_irq+0x174/0x1c4
> > > >>> [<ffffff8008083cf8>] el0_irq_naked+0x50/0x5c
> > > >>> [<ffffffffffffffff>] 0xffffffffffffffff
> > > >>>
> > > >>
> > > >> Thanks,
> > > >> Could you help me find out which line of code xhci_irq+0x728 is in your case.
> > > >>
> > > >> As Guenter pointed out there is a risk of turning the NULL pointer dereference
> > > >> into a use after free if we just solve this by setting xhci->shared_hcd = NULL
> > > >> later.
> > > >>
> > > >> If you still have that kernel around, and xhci is compiled in:
> > > >> gdb vmlinux
> > > >> gdb li *(xhci_irq+0x728)
> > > >>
> > > >
> > > > Sorry that I couldn't get back to you soon. The internal code version
> > > > for this issue was really old and a little bit difficult to rewind to
> > > > that version.
> > > > However, I think the following dump might be correct for the code base.
> > > >
> > > > (gdb) li *(xhci_irq+0x728)
> > > > 0xffffff8008cc8634 is in xhci_irq (*stripped*
> > > > kernel-4.14/drivers/usb/host/xhci.h:1694).
> > > > 1689 */
> > > > 1690 #define XHCI_MAX_REXIT_TIMEOUT_MS 20
> > > > 1691
> > > > 1692 static inline unsigned int hcd_index(struct usb_hcd *hcd)
> > > > 1693 {
> > > > 1694 if (hcd->speed >= HCD_USB3)
> > > > 1695 return 0;
> > > > 1696 else
> > > > 1697 return 1;
> > > > 1698 }
> > > > (gdb)
> > > >
> > > > Thanks
> > > > Macpaul Lin
> > > >
> > >
> > > Ah, it was a 4.14 kernel.
> > > This should be fixed in 4.20 with patch:
> > > 1245374e9b83 xhci: handle port status events for removed USB3 hcd
> > >
> > > Port arrays/structures were changed completely in 4.18
> > >
> > > Something like the below should work for 4.14:
> > >
> > > diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> > > index 61fa3007a74a..e7367b9f19c5 100644
> > > --- a/drivers/usb/host/xhci-ring.c
> > > +++ b/drivers/usb/host/xhci-ring.c
> > > @@ -1640,6 +1640,12 @@ static void handle_port_status(struct xhci_hcd *xhci,
> > > if ((major_revision == 0x03) != (hcd->speed >= HCD_USB3))
> > > hcd = xhci->shared_hcd;
> > >
> > > + if (!hcd) {
> > > + xhci_dbg(xhci, "No hcd found for port %u event\n", port_id);
> > > + bogus_port_status = true;
> > > + goto cleanup;
> > > + }
> > > +
> > > if (major_revision == 0) {
> > > xhci_warn(xhci, "Event for port %u not in "
> > > "Extended Capabilities, ignoring.\n",
> >
> > Thanks for this suggestion, this is much better! I am sorry that we're
> > using android kernel that some reported issue might be out of date. I
> > will update the suggestion into our code base. Thanks!
>
> Should I backport this to 4.14 and older kernels to prevent this issue
> from showing up in newer Android devices that are using these older
> kernels?
>
> thanks,
>
> greg k-h

If this could be backported to older kernel that will be great for newer
Android devices. Some of the shipping devices will have requirement of
kernel upgrade. Hence if you could backport this patch will be great.

Thanks!

Regards,
Macpaul Lin