RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

From: Joshi, Mukul
Date: Wed May 12 2021 - 16:12:26 EST


[AMD Official Use Only - Internal Distribution Only]


> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Wednesday, May 12, 2021 5:37 AM
> To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>; x86-ml <x86@xxxxxxxxxx>; lkml <linux-
> kernel@xxxxxxxxxxxxxxx>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> Hi,
>
> so this is a drive-by review using the lore.kernel.org mail because I wasn't CCed
> on this.
>
> On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > + unsigned long val, void *data) {
> > + struct mce *m = (struct mce *)data;
> > + struct amdgpu_device *adev = NULL;
> > + uint32_t gpu_id = 0;
> > + uint32_t umc_inst = 0;
> > + uint32_t chan_index = 0;
> > + struct ras_err_data err_data = {0, 0, 0, NULL};
> > + struct eeprom_table_record err_rec;
> > + uint64_t retired_page;
> > +
> > + /*
> > + * If the error was generated in UMC_V2, which belongs to GPU
> > + UMCs,
>
> Why does it matter if the error is a v2 UMC generated one?
>
> IOW, can you detect the type of errors in GPU memory - I assume this is what
> this is supposed to handle - from the error signature itself instead of doing
> is_smca_umc_v2?

SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is only interested in errors on GPU UMC.
We cannot know this without is_smca_umc_v2.

>
> > + * and error occurred in DramECC (Extended error code = 0) then only
> > + * process the error, else bail out.
> > + */
> > + if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0)))
> > + return NOTIFY_DONE;
> > +
> > + gpu_id = GET_MCA_IPID_GPUID(m->ipid);
> > +
> > + /*
> > + * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> > + */
> > + gpu_id -= GPU_ID_OFFSET;
> > +
> > + adev = find_adev(gpu_id);
> > + if (!adev) {
> > + dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> > + __func__, gpu_id);
> > + return NOTIFY_DONE;
> > + }
> > +
> > + /*
> > + * If it is correctable error, then print a message and return.
> > + */
> > + if (mce_is_correctable(m)) {
> > + dev_info(adev->dev, "%s: UMC Correctable error detected.",
> > + __func__);
>
> So put yourself in the shoes of a support engineer who starts getting all those
> calls about people's hardware getting correctable errors and should they replace
> it and should AMD RMA the GPUs and so on and so on..? Do you really wanna be
> on the receiving side of that call?
>
> IOW, whom does printing the fact that the GPU had a corrected error which got
> corrected by the hardware, help and do you really want to upset people
> needlessly?

Agree. I will remove this debug print.

>
> > + return NOTIFY_OK;
> > + }
> > +
> > + /*
> > + * If it is uncorrectable error, then find out UMC instance and
> > + * channel index.
> > + */
> > + find_umc_inst_chan_index(m, &umc_inst, &chan_index);
>
> That's a void function but it could return a pointer to the instance and channel
> bundled in a struct maybe...

Maybe. I hope its not too much of a concern if it stays the way it is.

>
> > +
> > + dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d,"
> > + " chan_idx: %d", umc_inst, chan_index);
> > +
> > + memset(&err_rec, 0x0, sizeof(struct eeprom_table_record));
> > +
> > + /*
> > + * Translate UMC channel address to Physical address
> > + */
> > + retired_page = ADDR_OF_8KB_BLOCK(m->addr) |
> > + ADDR_OF_256B_BLOCK(chan_index) |
> > + OFFSET_IN_256B_BLOCK(m->addr);
> > +
> > + err_rec.address = m->addr;
> > + err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> > + err_rec.ts = (uint64_t)ktime_get_real_seconds();
> > + err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE;
> > + err_rec.cu = 0;
> > + err_rec.mem_channel = chan_index;
> > + err_rec.mcumc_id = umc_inst;
> > +
> > + err_data.err_addr = &err_rec;
> > + err_data.err_addr_cnt = 1;
> > +
> > + if (amdgpu_bad_page_threshold != 0) {
> > + amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> > + err_data.err_addr_cnt);
> > + amdgpu_ras_save_bad_pages(adev);
> > + }
> > +
> > + return NOTIFY_OK;
> > +}
> > +
> > +static struct notifier_block amdgpu_bad_page_nb = {
> > + .notifier_call = amdgpu_bad_page_notifier,
> > + .priority = MCE_PRIO_ACCEL,
>
> Nothing ever explains why this needs to be a separate priority. So how about it?

I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
are two different class of devices.
That is the reason I create a new one.
I am OK to use EDAC priority if that is acceptable.

>
> > +};
> > +
> > +static void amdgpu_register_bad_pages_mca_notifier(void)
> > +{
> > + /*
> > + * Register the x86 notifier with MCE subsystem.
> > + * Please note a notifier can be registered only once
> > + * with the MCE subsystem.
> > + */
>
> Why do you need this? Other users don't need that. Do you need to call
> mce_unregister_decode_chain() when the driver gets removed or so?
>
> > + if (notifier_registered == false) {
>
> This is just silly and should be fixed properly once the issue is understood.

A system can have multiple GPUs and we only want a single notifier registered.
I will change the comment to explicitly state this.

Thanks,
Mukul

>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C1c231207786
> 446018e7208d915297942%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564090158211378%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =7LSOnoJWrTf5z96YFACxuRKZP1z4E4zkvtrNzjbTaPs%3D&amp;reserved=0