Re: xhci_pci & PCIe hotplug crash

From: Pali Rohár
Date: Wed May 05 2021 - 09:02:46 EST


On Wednesday 05 May 2021 14:44:02 Lukas Wunner wrote:
> On Wed, May 05, 2021 at 02:33:46PM +0200, Pali Rohár wrote:
> > I just spotted this crash during debugging PCIe controller driver
> > pci-aardvark.c with trying to expose its link down events via "hot plug"
> > interrupt and corresponding link layer state flags.
> >
> > And because in whole call trace I see only generic PCIe and USB code
> > path without any driver specific parts, I suspect that this is not PCIe
> > controller-specific issue but rather something "wrong" in genetic PCIe
> > (or USB) code. That is why I sent this email, so maybe somebody else
> > find something suspicious here.
> >
> > But still there is a chance that issue can be also in pci-aardvark.c
> > driver and somehow it masked its issue and propagated it into generic
> > PCIe hot plug code path.
>
> If you hot-remove the XHCI controller, accesses to its MMIO space
> will fail. xhci_irq() seems to perform such MMIO accesses.

That abort happens at offset 4d00, here is part of objdump:

if (!arch_irqs_disabled_flags(flags))
4ccc: 340014a0 cbz w0, 4f60 <xhci_irq+0x2d0>
4cd0: d2800000 mov x0, #0x0 // #0
4cd4: 910a7276 add x22, x19, #0x29c
4cd8: 52800022 mov w2, #0x1 // #1
4cdc: f98002d1 prfm pstl1strm, [x22]
4ce0: 885ffec1 ldaxr w1, [x22]
4ce4: 4a000023 eor w3, w1, w0
4ce8: 35000063 cbnz w3, 4cf4 <xhci_irq+0x64>
4cec: 88037ec2 stxr w3, w2, [x22]
4cf0: 35ffff83 cbnz w3, 4ce0 <xhci_irq+0x50>
4cf4: 35002741 cbnz w1, 51dc <xhci_irq+0x54c>
status = readl(&xhci->op_regs->status);
4cf8: f9400f41 ldr x1, [x26, #24]
4cfc: 91001021 add x1, x1, #0x4
4d00: b9400021 ldr w1, [x1]

So it looks like it is that MMIO access, right?

> Normally this should happen silently and MMIO accesses just return
> with a fabricated "all ones" response. Chances are however that the
> Aardvark controller raises a synchronous external abort instead.

This makes sense. Good catch lso with fact that it is from threaded
context!

> Perhaps you can teach it not to do that.

No :-( I read all documentation which is available for this PCIe
controller, part of Marvell A3720 SoC and I have not found anything
which allows me to configure raising external aborts.

I already figured out that CPU receive external abort also when trying
to issue a new PIO transfer for accessing PCI config space while
previous transfer has not finished yet. And also there is no way (at
least in documentation) which allows to "mask" this external abort. But
this issue can be fixed in pci-aardvark.c driver to disallow access to
config space while previous transfer is still running (I will send patch
for this one).

So seems that PCIe controller HW triggers these external aborts when
device on PCIe bus is not accessible anymore.

If this issue is really caused by MMIO access from xhci driver when
device is not accessible on the bus anymore, can we do something to
prevent this kernel crash? Somehow mask that external abort in kernel
for a time during MMIO access?

> Thanks,
>
> Lukas