Re: [PATCH v7 00/45] Recover from failure to probe GPU

From: Alex Deucher
Date: Thu Jan 05 2023 - 12:36:16 EST


On Thu, Jan 5, 2023 at 12:02 PM Mario Limonciello
<mario.limonciello@xxxxxxx> wrote:
>
> One of the first thing that KMS drivers do during initialization is
> destroy the system firmware framebuffer by means of
> `drm_aperture_remove_conflicting_pci_framebuffers`
>
> This means that if for any reason the GPU failed to probe the user
> will be stuck with at best a screen frozen at the last thing that
> was shown before the KMS driver continued it's probe.
>
> The problem is most pronounced when new GPU support is introduced
> because users will need to have a recent linux-firmware snapshot
> on their system when they boot a kernel with matching support.
>
> However the problem is further exaggerated in the case of amdgpu because
> it has migrated to "IP discovery" where amdgpu will attempt to load
> on "ALL" AMD GPUs even if the driver is missing support for IP blocks
> contained in that GPU.
>
> IP discovery requires some probing and isn't run until after the
> framebuffer has been destroyed.
>
> This means a situation can occur where a user purchases a new GPU not
> yet supported by a distribution and when booting the installer it will
> "freeze" even if the distribution doesn't have the matching kernel support
> for those IP blocks.
>
> The perfect example of this is Ubuntu 22.10 and the new dGPUs just
> launched by AMD. The installation media ships with kernel 5.19 (which
> has IP discovery) but the amdgpu support for those IP blocks landed in
> kernel 6.0. The matching linux-firmware was released after 22.10's launch.
> The screen will freeze without nomodeset. Even if a user manages to install
> and then upgrades to kernel 6.0 after install they'll still have the
> problem of missing firmware, and the same experience.
>
> This is quite jarring for users, particularly if they don't know
> that they have to use "nomodeset" to install.
>
> To help the situation make changes to GPU discovery:
> 1) Delay releasing the firmware framebuffer until after early_init
> completed. This will help the situation of an older kernel that doesn't
> yet support the IP blocks probing a new GPU. IP discovery will have failed.
> 2) Request loading all PSP, VCN, SDMA, SMU, DMCUB, MES and GC microcode
> into memory during early_init. This will help the situation of new enough
> kernel for the IP discovery phase to otherwise pass but missing microcode
> from linux-firmware.git.

Series is:
Reviewed-by: Alex Deucher <alexander.deucher@xxxxxxx>

>
> v6->v7:
> * Pick up tags
> * Fix PSP TAv1 handling to match previous behavior (securedisplay_context
> only is set on PSPv10 and PSPv12/Renoir)
> v5->v6:
> * Fix arguments for amdgpu_ucode_release to allow clearing pointer
> * Fix whitespace mistake in VCN
> * Pick up tags
> v4->v5:
> * Rename amdgpu_ucode_load to amdgpu_ucode_request
> * Add and utilize amdgpu_ucode_release throughout existing patches
> * Update all amdgpu code to stop using request_firmware and
> release_firmware for microcode
> * Drop export of amdgpu_ucode_validate outside of amdgpu_ucode.c
> * Pick up relevant tags for some patches
> v3->v4:
> * Rework to delay framebuffer release until early_init is done
> * Make IP load microcode during early init phase
> * Add SMU and DMCUB checks for early_init loading
> * Add some new helper code for wrapping request_firmware calls (needed for
> early_init to return something besides -ENOENT)
> v2->v3:
> * Pick up tags for patches 1-10
> * Rework patch 11 to not validate during discovery
> * Fix bugs with GFX9 due to gfx.num_gfx_rings not being set during
> discovery
> * Fix naming scheme for SDMA on dGPUs
> v1->v2:
> * Take the suggestion from v1 thread to delay the framebuffer release
> until ip discovery is done. This patch is CC to stable to that older
> stable kernels with IP discovery won't try to probe unknown IP.
> * Drop changes to drm aperature.
> * Fetch SDMA, VCN, MES, GC and PSP microcode during IP discovery.
>
> Mario Limonciello (27):
> drm/amd: Delay removal of the firmware framebuffer
> drm/amd: Add a legacy mapping to "amdgpu_ucode_ip_version_decode"
> drm/amd: Convert SMUv11 microcode to use
> `amdgpu_ucode_ip_version_decode`
> drm/amd: Convert SMUv13 microcode to use
> `amdgpu_ucode_ip_version_decode`
> drm/amd: Add a new helper for loading/validating microcode
> drm/amd: Use `amdgpu_ucode_request` helper for SDMA
> drm/amd: Convert SDMA to use `amdgpu_ucode_ip_version_decode`
> drm/amd: Make SDMA firmware load failures less noisy.
> drm/amd: Use `amdgpu_ucode_*` helpers for VCN
> drm/amd: Load VCN microcode during early_init
> drm/amd: Load MES microcode during early_init
> drm/amd: Use `amdgpu_ucode_*` helpers for MES
> drm/amd: Remove superfluous assignment for `adev->mes.adev`
> drm/amd: Use `amdgpu_ucode_*` helpers for GFX9
> drm/amd: Load GFX9 microcode during early_init
> drm/amd: Use `amdgpu_ucode_*` helpers for GFX10
> drm/amd: Load GFX10 microcode during early_init
> drm/amd: Use `amdgpu_ucode_*` helpers for GFX11
> drm/amd: Load GFX11 microcode during early_init
> drm/amd: Parse both v1 and v2 TA microcode headers using same function
> drm/amd: Avoid BUG() for case of SRIOV missing IP version
> drm/amd: Load PSP microcode during early_init
> drm/amd: Use `amdgpu_ucode_*` helpers for PSP
> drm/amd/display: Load DMUB microcode during early_init
> drm/amd: Use `amdgpu_ucode_release` helper for DMUB
> drm/amd: Use `amdgpu_ucode_*` helpers for SMU
> drm/amd: Load SMU microcode during early_init
> drm/amd: Optimize SRIOV switch/case for PSP microcode load
> drm/amd: Use `amdgpu_ucode_*` helpers for GFX6
> drm/amd: Use `amdgpu_ucode_*` helpers for GFX7
> drm/amd: Use `amdgpu_ucode_*` helpers for GFX8
> drm/amd: Use `amdgpu_ucode_*` helpers for GMC6
> drm/amd: Use `amdgpu_ucode_*` helpers for GMC7
> drm/amd: Use `amdgpu_ucode_*` helpers for GMC8
> drm/amd: Use `amdgpu_ucode_*` helpers for SDMA2.4
> drm/amd: Use `amdgpu_ucode_*` helpers for SDMA3.0
> drm/amd: Use `amdgpu_ucode_*` helpers for SDMA on CIK
> drm/amd: Use `amdgpu_ucode_*` helpers for UVD
> drm/amd: Use `amdgpu_ucode_*` helpers for VCE
> drm/amd: Use `amdgpu_ucode_*` helpers for CGS
> drm/amd: Use `amdgpu_ucode_*` helpers for GPU info bin
> drm/amd: Use `amdgpu_ucode_*` helpers for DMCU
> drm/amd: Use `amdgpu_ucode_release` helper for powerplay
> drm/amd: Use `amdgpu_ucode_release` helper for si
> drm/amd: make amdgpu_ucode_validate static
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c | 11 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 59 ++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 299 +++++++++---------
> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 25 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 259 ++++++++++++++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 65 +---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 +
> drivers/gpu/drm/amd/amdgpu/cik_sdma.c | 16 +-
> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 155 +++------
> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 124 +++-----
> drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c | 30 +-
> drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 68 +---
> drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 94 ++----
> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 140 ++------
> drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 13 +-
> drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13 +-
> drivers/gpu/drm/amd/amdgpu/imu_v11_0.c | 7 +-
> drivers/gpu/drm/amd/amdgpu/mes_v10_1.c | 108 ++-----
> drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 99 ++----
> drivers/gpu/drm/amd/amdgpu/psp_v10_0.c | 80 +----
> drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 131 +-------
> drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 79 +----
> drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 27 +-
> drivers/gpu/drm/amd/amdgpu/psp_v13_0_4.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 16 +-
> drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c | 18 +-
> drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c | 18 +-
> drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 47 +--
> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 30 +-
> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 55 +---
> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 25 +-
> drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c | 5 +-
> drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 5 +-
> drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 5 +-
> drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 5 +-
> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 5 +-
> .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 110 ++++---
> drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 11 +-
> .../gpu/drm/amd/pm/powerplay/amd_powerplay.c | 3 +-
> drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 12 +-
> .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 51 +--
> .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 28 +-
> 50 files changed, 900 insertions(+), 1545 deletions(-)
>
> --
> 2.34.1
>