Re: Boot failure on gru-scarlet-inx with 5.9-rc2

From: Marc Zyngier
Date: Mon Aug 31 2020 - 05:27:42 EST


On 2020-08-31 08:18, Samuel Dionne-Riel wrote:
On Sun, 30 Aug 2020 10:41:42 +0100
Marc Zyngier <maz@xxxxxxxxxx> wrote:

Hi,


Could you try replacing the problematic patch with [1], and let me
know whether this changes anything on your end? This patch probably
isn't the right approach, but it would certainly help pointing me
in the right direction.

[1]
https://lore.kernel.org/lkml/20200815125112.462652-2-maz@xxxxxxxxxx/

Following through a bisect session to figure out why the Wi-Fi broke
between 5.8 and 5.9-rc1, I figured out something that you might have in
mind already.

It seems that anything that makes of_bus_pci_match return true will
cause this to happen. This is why your initial fix also fails.

I believe my understanding is right since applying the following on top
of 5.9-rc1 also produces the same result.

--- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
@@ -227,6 +227,7 @@ dmac_peri: dma-controller@ff6e0000 {
};

pcie0: pcie@f8000000 {
+ device_type = "pci";
compatible = "rockchip,rk3399-pcie";
reg = <0x0 0xf8000000 0x0 0x2000000>,
<0x0 0xfd000000 0x0 0x1000000>;


This was found out since the Wi-Fi pci-based ath10k Wi-Fi broke, with
2f96593ecc37e98bf99525f0629128080533867f, which changes stuff around
pci bus... things...

Am I understanding right that your fix(es) were related to the change
set where the commit is found?

My intuition is that the commit causing the boot issue could be related
to changes with PCI or PCIe subsystems, and that your fix for
of_bus_pci_match is a red herring, that only surfaced the existing
issue.

This is backed by applying the previous dts patch on top of 2f96593e,
and having Wi-Fi work. I would assume that between that commit and
5.9-rc1 there is a commit that causes the complete failure to boot,
which is unrelated to the first identified commit on 5.9-rc2.

Ah, so actually anything that *enables pcie* kills your system.
Great investigative work!


And backed by a further bisection with this that points to
d84c572de1a360501d2e439ac632126f5facf59d being the actual change that
causes the tablet to fail to boot, as long as the pcie0 node is
identified as pci properly.

I am unsure if I should add as a Cc everyone involved in that change
set, though the author (coincidentally) is already in the original list
of recipients.

I've deliberately moved Rob from Cc to To... ;-)

Any additional thoughts from this additional information?

What you could do is to start looking at which of the pci_is_root_bus()
changes breaks PCIe on this system. The fact that it breaks on your
system and not on mine is a bit puzzling.

Thanks,

M.
--
Jazz is not dead. It just smells funny...