Keyword Review - Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

From: Christian König
Date: Fri Feb 24 2023 - 10:19:28 EST


Hi Mikhail,

this is pretty clearly a problem with the system and/or it's BIOS and not the GPU hw or the driver.

The option pci=nocrs makes the kernel ignore additional resource windows the BIOS reports through ACPI. This then most likely leads to problems with amdgpu because it can't bring up its PCIe resources any more.

The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help understand the problem, but I strongly suggest to try a BIOS update first.

Regards,
Christian.

Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov:
Hi,
I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
it is impossible to use without AC power because the system losts nvme
when I disconnect the power adapter.

Messages from kernel log when it happens:
nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
and report a bug

I tried to use recommended parameters
(nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
this issue, but without successed.

In the linux-nvme mail list the last advice was to try the "pci=nocrs"
parameter.

But with this parameter the amdgpu driver refuses to work and makes
the system unbootable. I can solve the problem with the booting system
by blacklisting the driver but it is not a good solution, because I
don't wanna lose the GPU.

Why amdgpu not work with "pci=nocrs" ?
And is it possible to solve this incompatibility?
It is very important because when I boot the system without amdgpu
driver with "pci=nocrs" nvme is not losts when I disconnect the power
adapter. So "pci=nocrs" really helps.

Below that I see in kernel log when adds "pci=nocrs" parameter:

amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
amdgpu: ATOM BIOS: SWBRT77321.001
[drm] VCN(0) decode is enabled in VM mode
[drm] VCN(0) encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
Console: switching to colour dummy device 80x25
amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
disabled as experimental (default)
[drm] GPU posting now...
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
size is 9-bit
amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
0x00000082FEFFFFFF (12272M used)
amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=12272M, BAR=16384M
[drm] RAM width 192bits GDDR6
[drm] amdgpu: 12272M of VRAM memory ready
[drm] amdgpu: 31774M of GTT memory ready.
amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
[drm] Debug VRAM access will use slowpath MM access
amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
[drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
<gmc_v10_0> failed -12
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.

Of course a full system log is also attached.