Re: [Nouveau] [RFC, drm-misc-next v4 0/9] PCI/VGA: Allowing the user to select the primary video adapter at boot time

From: Christian König
Date: Thu Sep 07 2023 - 11:38:55 EST


Am 07.09.23 um 17:26 schrieb suijingfeng:
[SNIP]

Then, I'll give you another example, see below for elaborate description.
I have one AMD BC160 GPU, see[1] to get what it looks like.

The GPU don't has a display connector interface exported.
It actually can be seen as a render-only GPU or compute class GPU for bitcoin.
But the firmware of it still acclaim this GPU as VGA compatible.
When mount this GPU onto motherboard, the system always select this GPU as primary.
But this GPU can't be able to connect with a monitor.

Under such a situation, modprobe.blacklist=amdgpu don't works either,
because vgaarb always select this GPU as primary, this is a device-level decision.

It's not VGAARB which makes this selection, it's the BIOS. VGAARB just detects what the BIOS has decided.


$ dmesg | grep vgaarb:

[    3.541405] pci 0000:0c:00.0: vgaarb: BAR 0: [mem 0xa0000000-0xafffffff 64bit pref] contains firmware FB [0xa0000000-0xa02fffff]
[    3.901448] pci 0000:05:00.0: vgaarb: setting as boot VGA device
[    3.905375] pci 0000:05:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    3.905382] pci 0000:0c:00.0: vgaarb: setting as boot VGA device (overriding previous)
[    3.909375] pci 0000:0c:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    3.913375] pci 0000:0d:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    3.913377] vgaarb: loaded
[   13.513760] amdgpu 0000:0c:00.0: vgaarb: deactivate vga console
[   19.020992] amdgpu 0000:0c:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem

I'm using ubuntu 22.04 system, with ast.modeset=10 passed on the cmd line,
I still be able to enter the graphics system. And views this GPU as a render-only GPU.
Probably continue to examine what's wrong, except this, drm/amdgpu report
" *ERROR* IB test failed on sdma0 (-110)" to me.

Does this count as problem?

No, again that is perfectly expected behavior.

Some BIOSes (or maybe most by modern standard) allows to override this, but if you later override this by the OS you run the hardware outside what's validated.

When you put a VGA device into a board with an integrated VGA device the integrated one gets disabled. This is even part of some PCIe specification IIRC.

So the problems you run into here are perfectly expected.

Regards,
Christian.


Before I could find solution, I have keep this de-fact render only GPU mounted.
Because I need recompile kennel module, install the kernel module and testing.

All I need is a 2D video card to display something, ast drm is OK, despite simple.
It suit the need for my daily usage with VIM, that's enough for me.

Now, the real questions that I want ask is:

1)

Does the fact that when the kernel driver module got blocked (by modprobe.blacklist=amdgpu),
while the vgaarb still select it as primary which leave the X server crash there (because no kennel space driver loaded)
count as a problem?


2)

Does my approach that mounting another GPU as the primary display adapter,
while its real purpose is to solving bugs and development for another GPU,
count as a use case?


$ cat demsg.txt | grep drm

[   10.099888] ACPI: bus type drm_connector registered
[   11.083920] etnaviv 0000:0d:00.0: [drm] bind etnaviv-display, master name: 0000:0d:00.0
[   11.084106] [drm] Initialized etnaviv 1.3.0 20151214 for 0000:0d:00.0 on minor 0
[   13.301702] [drm] amdgpu kernel modesetting enabled.
[   13.359820] [drm] initializing kernel modesetting (NAVI12 0x1002:0x7360 0x1002:0x0A34 0xC7).
[   13.368246] [drm] register mmio base: 0xEB100000
[   13.372861] [drm] register mmio size: 524288
[   13.380788] [drm] add ip block number 0 <nv_common>
[   13.385661] [drm] add ip block number 1 <gmc_v10_0>
[   13.390531] [drm] add ip block number 2 <navi10_ih>
[   13.395405] [drm] add ip block number 3 <psp>
[   13.399760] [drm] add ip block number 4 <smu>
[   13.404111] [drm] add ip block number 5 <dm>
[   13.408378] [drm] add ip block number 6 <gfx_v10_0>
[   13.413249] [drm] add ip block number 7 <sdma_v5_0>
[   13.433546] [drm] add ip block number 8 <vcn_v2_0>
[   13.433547] [drm] add ip block number 9 <jpeg_v2_0>
[   13.497757] [drm] VCN decode is enabled in VM mode
[   13.502540] [drm] VCN encode is enabled in VM mode
[   13.508785] [drm] JPEG decode is enabled in VM mode
[   13.529596] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   13.564762] [drm] Detected VRAM RAM=8176M, BAR=256M
[   13.569628] [drm] RAM width 2048bits HBM
[   13.574167] [drm] amdgpu: 8176M of VRAM memory ready
[   13.579125] [drm] amdgpu: 15998M of GTT memory ready.
[   13.584184] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   13.590505] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[   13.598749] [drm] Found VCN firmware Version ENC: 1.16 DEC: 5 VEP: 0 Revision: 4
[   13.671786] [drm] reserve 0xe00000 from 0x81fd000000 for PSP TMR
[   13.801235] [drm] Display Core v3.2.247 initialized on DCN 2.0
[   13.807061] [drm] DP-HDMI FRL PCON supported
[   13.832382] [drm] kiq ring mec 2 pipe 1 q 0
[   13.838131] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[   13.845877] [drm] JPEG decode initialized successfully.
[   14.072508] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:0c:00.0 on minor 1
[   14.080976] amdgpu 0000:0c:00.0: [drm] Cannot find any crtc or sizes
[   14.087341] [drm] DSC precompute is not needed.
[   16.487330] systemd[1]: Starting Load Kernel Module drm...
[  619.901873] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[  619.901898] [drm] PSP is resuming...
[  619.925307] [drm] reserve 0xe00000 from 0x81fd000000 for PSP TMR
[  619.991034] [drm] psp gfx command AUTOLOAD_RLC(0x21) failed and response status is (0xFFFF000D)
[  620.294366] [drm] kiq ring mec 2 pipe 1 q 0
[  620.298953] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  620.299103] [drm] JPEG decode initialized successfully.
[  621.309543] [drm:sdma_v5_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out
[  621.317577] amdgpu 0000:0c:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110).
[  622.333548] [drm:sdma_v5_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out
[  622.341587] amdgpu 0000:0c:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma1 (-110).
[  622.354071] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).
[  622.363721] amdgpu 0000:0c:00.0: [drm] Cannot find any crtc or sizes

[1] https://www.techpowerup.com/gpu-specs/xfx-bc-160.b9346