Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

From: Christian König
Date: Fri Feb 24 2023 - 10:31:24 EST


Am 24.02.23 um 13:29 schrieb Christian König:
Am 24.02.23 um 09:38 schrieb Mikhail Gavrilov:
On Fri, Feb 24, 2023 at 12:13 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Hi Mikhail,

this is pretty clearly a problem with the system and/or it's BIOS and
not the GPU hw or the driver.

The option pci=nocrs makes the kernel ignore additional resource windows
the BIOS reports through ACPI. This then most likely leads to problems
with amdgpu because it can't bring up its PCIe resources any more.

The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
understand the problem
I attach both lspci for pci=nocrs and without pci=nocrs.

The differences for Cezanne Radeon Vega Series:
with pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 4: I/O ports at e000 [disabled] [size=256]
Capabilities: [c0] MSI-X: Enable- Count=4 Masked-

Without pci=nocrs:
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Interrupt: pin A routed to IRQ 44
Region 4: I/O ports at e000 [size=256]
Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-


The differences for Navi 22 Radeon 6800M:
with pci=nocrs:
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G]
Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M]
Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M]

Well that explains it. When the PCI subsystem has to disable the BARs of the GPU we can't access it any more.

The only thing we could do is to make sure that the driver at least fails gracefully.

Do you still have network access to the box when amdgpu fails to load and could grab whatevery is in dmesg?

Sorry I totally missed that you attached the full dmesg to your original mail.

Yeah, the driver did fail gracefully. But then X doesn't come up and then gdm just dies.

Sorry there is really nothing we can do here, maybe ping somebody with more ACPI background for help.

Regards,
Christian.


Thanks,
Christian.

AtomicOpsCtl: ReqEn-
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000  Data: 0000

Without pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 103
Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
AtomicOpsCtl: ReqEn+
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000  Data: 0000

but I strongly suggest to try a BIOS update first.
This is the first thing that was done. And I am afraid no more BIOS updates.
https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/

I also have experience in dealing with manufacturers' tech support.
Usually it ends with "we do not provide drivers for Linux".