Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

From: Alex Deucher
Date: Thu May 13 2021 - 10:33:11 EST


On Thu, May 13, 2021 at 10:30 AM Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> > The bad pages are stored in an EEPROM on the board and the next time
> > the driver loads it reads the EEPROM so that it can reserve the bad
> > pages at init time so they don't get used again.
>
> And that works automagically on the next boot? Because that sounds like
> the right thing to do.

Yes, or driver reload, suspend/resume, etc.

>
> So practically, what happens to a GPU in such a case where the VRAM
> starts going bad? It might get exhausted eventually and the driver will
> say something along the lines of:
>
> "VRAM bad pages: 80%, consider replacing the GPU. It is operating
> currently with degrated performance."
>
> or so?

Right. The sys admin can query the bad page count and decide when to
retire the card.

>
> Yap, from a RAS perspective, that makes good sense as you're prolonging
> the life of the component while still remains operational as good as it
> can and the only user interaction you need is she/he replacing it.
>
> Sounds good.

Yes. That's the idea.

Alex


>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette