RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

From: Joshi, Mukul
Date: Thu May 13 2021 - 19:10:42 EST


[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Thursday, May 13, 2021 5:53 AM
> To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>; x86-ml <x86@xxxxxxxxxx>; lkml <linux-
> kernel@xxxxxxxxxxxxxxx>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is
> defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile
> > the amdgpu driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
>
> See how smca_get_long_name() is exported and export that function the same
> way.
>

That's probably not the best example to look at.
smca_get_long_name() is used in drivers/edac/mce_amd.c and this file doesn't
get compiled when CONFIG_X86_MCE_AMD is not defined.

And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.

So here is one option that we can try:
1. Export smca_get_bank_type().
2. I wrap my entire code in GPU driver with #ifdef CONFIG_X86_MCE_AMD

Will that work for you?

Thanks,
Mukul

> To save you some energy: is_smca_umc_v2() is not going to happen.


>
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
>
> An EDAC "device", as you call it, is not interested in handling UEs. If anything, it
> counts them.
>
> > It is a deferred interrupt that generates an MCE.
>
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
>
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
>
> What page is that? Normal DRAM page or a page in some special GPU memory?
>
> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
>
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
>
> Looking at
>
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
>
> that's some VRAM thing so I'm guessing special memory on the GPU.
>
> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the same
> retiring happens?
>
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
>
> Ok, I'm a user and I can access that information through sysfs. What can I do
> with it?
>
> > Hope it clears what "bad page retirement" is achieving.
>
> It is getting there.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7Cd8c660fce3a2
> 4ce3c6d408d915f4efa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> 7C637564964013263414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=
> %2BnJ%2B99N%2FRljoHGALimZHZG%2Bmf9jL5zP2eA44I6pbzFY%3D&amp;reser
> ved=0