Re: [PATCH] xhci-mtk: Fix NULL pointer dereference with xhci_irq() for shared_hcd

From: Greg Kroah-Hartman
Date: Thu Mar 05 2020 - 13:32:11 EST


On Thu, Mar 05, 2020 at 10:58:46AM +0800, Macpaul Lin wrote:
> On Wed, 2020-03-04 at 16:39 +0200, Mathias Nyman wrote:
> > On 4.3.2020 5.16, Macpaul Lin wrote:
> > > On Tue, 2020-02-04 at 17:44 +0800, Mathias Nyman wrote:
> > >> On 1.2.2020 13.20, Macpaul Lin wrote:
> > >>> On Fri, 2020-01-31 at 16:50 +0200, Mathias Nyman wrote:
> > >>>> On 17.1.2020 9.41, Macpaul Lin wrote:
> > >>>>> According to NULL pointer fix: https://tinyurl.com/uqft5ra
> > >>>>> xhci: Fix NULL pointer dereference with xhci_irq() for shared_hcd
> > >>>>> The similar issue has also been found in QC activities in Mediatek.
> > >>>>>
> > >>>>> Here quote the description from the referenced patch as follows.
> > >>>>> "Commit ("f068090426ea xhci: Fix leaking USB3 shared_hcd
> > >>>>> at xhci removal") sets xhci_shared_hcd to NULL without
> > >>>>> stopping xhci host. This results into a race condition
> > >>>>> where shared_hcd (super speed roothub) related interrupts
> > >>>>> are being handled with xhci_irq happens when the
> > >>>>> xhci_plat_remove is called and shared_hcd is set to NULL.
> > >>>>> Fix this by setting the shared_hcd to NULL only after the
> > >>>>> controller is halted and no interrupts are generated."
> > >>>>>
> > >>>>> Signed-off-by: Sriharsha Allenki <sallenki@xxxxxxxxxxxxxx>
> > >>>>> Signed-off-by: Macpaul Lin <macpaul.lin@xxxxxxxxxxxx>
> > >>>>> ---
> > >>>>> drivers/usb/host/xhci-mtk.c | 2 +-
> > >>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>>>
> > >>>>> diff --git a/drivers/usb/host/xhci-mtk.c b/drivers/usb/host/xhci-mtk.c
> > >>>>> index b18a6baef204..c227c67f5dc5 100644
> > >>>>> --- a/drivers/usb/host/xhci-mtk.c
> > >>>>> +++ b/drivers/usb/host/xhci-mtk.c
> > >>>>> @@ -593,11 +593,11 @@ static int xhci_mtk_remove(struct platform_device *dev)
> > >>>>> struct usb_hcd *shared_hcd = xhci->shared_hcd;
> > >>>>>
> > >>>>> usb_remove_hcd(shared_hcd);
> > >>>>> - xhci->shared_hcd = NULL;
> > >>>>> device_init_wakeup(&dev->dev, false);
> > >>>>>
> > >>>>> usb_remove_hcd(hcd);
> > >>>>> usb_put_hcd(shared_hcd);
> > >>>>> + xhci->shared_hcd = NULL;
> > >>>>> usb_put_hcd(hcd);
> > >>>>> xhci_mtk_sch_exit(mtk);
> > >>>>> xhci_mtk_clks_disable(mtk);
> > >>>>>
> > >>>>
> > >>>> Could you share details of the NULL pointer dereference, (backtrace).
> > >>>
> > >>> This bug was found by our QA staff while doing 500 times plug-in and
> > >>> plug-out devices. The backtrace I have was recorded by QA and I didn't
> > >>> reproduce this issue on my own environment. However, after applied this
> > >>> patch the issue seems resolve. Here is the backtrace:
> > >>>
> > >>> Exception Class: Kernel (KE)
> > >>> PC is at [<ffffff8008cccbc0>] xhci_irq+0x728/0x2364
> > >>> LR is at [<ffffff8008ccc788>] xhci_irq+0x2f0/0x2364
> > >>>
> > >>> Current Executing Process:
> > >>> [iptables, 859][netdagent, 770]
> > >>>
> > >>> Backtrace:
> > >>> [<ffffff80080ead58>] __atomic_notifier_call_chain+0xa8/0x130
> > >>> [<ffffff80080eb6d4>] notify_die+0x84/0xac
> > >>> [<ffffff800808e874>] die+0x1d8/0x3b8
> > >>> [<ffffff80080a89b0>] __do_kernel_fault+0x178/0x188
> > >>> [<ffffff80080a81b4>] do_page_fault+0x44/0x3b0
> > >>> [<ffffff80080a811c>] do_translation_fault+0x44/0x98
> > >>> [<ffffff8008080e08>] do_mem_abort+0x4c/0x128
> > >>> [<ffffff80080832d0>] el1_da+0x24/0x3c
> > >>> [<ffffff8008cccbc0>] xhci_irq+0x728/0x2364
> > >>> [<ffffff8008c98804>] usb_hcd_irq+0x2c/0x44
> > >>> [<ffffff8008179bb0>] __handle_irq_event_percpu+0x26c/0x4a4
> > >>> [<ffffff8008179ec8>] handle_irq_event+0x5c/0xd0
> > >>> [<ffffff800817e3c0>] handle_fasteoi_irq+0x10c/0x1e0
> > >>> [<ffffff80081787b0>] __handle_domain_irq+0x32c/0x738
> > >>> [<ffffff800808159c>] gic_handle_irq+0x174/0x1c4
> > >>> [<ffffff8008083cf8>] el0_irq_naked+0x50/0x5c
> > >>> [<ffffffffffffffff>] 0xffffffffffffffff
> > >>>
> > >>
> > >> Thanks,
> > >> Could you help me find out which line of code xhci_irq+0x728 is in your case.
> > >>
> > >> As Guenter pointed out there is a risk of turning the NULL pointer dereference
> > >> into a use after free if we just solve this by setting xhci->shared_hcd = NULL
> > >> later.
> > >>
> > >> If you still have that kernel around, and xhci is compiled in:
> > >> gdb vmlinux
> > >> gdb li *(xhci_irq+0x728)
> > >>
> > >
> > > Sorry that I couldn't get back to you soon. The internal code version
> > > for this issue was really old and a little bit difficult to rewind to
> > > that version.
> > > However, I think the following dump might be correct for the code base.
> > >
> > > (gdb) li *(xhci_irq+0x728)
> > > 0xffffff8008cc8634 is in xhci_irq (*stripped*
> > > kernel-4.14/drivers/usb/host/xhci.h:1694).
> > > 1689 */
> > > 1690 #define XHCI_MAX_REXIT_TIMEOUT_MS 20
> > > 1691
> > > 1692 static inline unsigned int hcd_index(struct usb_hcd *hcd)
> > > 1693 {
> > > 1694 if (hcd->speed >= HCD_USB3)
> > > 1695 return 0;
> > > 1696 else
> > > 1697 return 1;
> > > 1698 }
> > > (gdb)
> > >
> > > Thanks
> > > Macpaul Lin
> > >
> >
> > Ah, it was a 4.14 kernel.
> > This should be fixed in 4.20 with patch:
> > 1245374e9b83 xhci: handle port status events for removed USB3 hcd
> >
> > Port arrays/structures were changed completely in 4.18
> >
> > Something like the below should work for 4.14:
> >
> > diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> > index 61fa3007a74a..e7367b9f19c5 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -1640,6 +1640,12 @@ static void handle_port_status(struct xhci_hcd *xhci,
> > if ((major_revision == 0x03) != (hcd->speed >= HCD_USB3))
> > hcd = xhci->shared_hcd;
> >
> > + if (!hcd) {
> > + xhci_dbg(xhci, "No hcd found for port %u event\n", port_id);
> > + bogus_port_status = true;
> > + goto cleanup;
> > + }
> > +
> > if (major_revision == 0) {
> > xhci_warn(xhci, "Event for port %u not in "
> > "Extended Capabilities, ignoring.\n",
>
> Thanks for this suggestion, this is much better! I am sorry that we're
> using android kernel that some reported issue might be out of date. I
> will update the suggestion into our code base. Thanks!

Should I backport this to 4.14 and older kernels to prevent this issue
from showing up in newer Android devices that are using these older
kernels?

thanks,

greg k-h