Re: [PATCH] ghes: Track number of recovered hardware errors

From: Breno Leitao
Date: Wed Jul 16 2025 - 08:43:12 EST


hello Shuai,

On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
> > My plan with this patch is to have a counter for hardware errors that
> > would be exposed to the crashdump. So, post-morten analyzes tooling can
> > easily query if there are hardware errors and query RAS information in
> > the right databases, in case it seems a smoking gun.
>
> I see your point. But does using a single ghes_recovered_errors counter
> to track all corrected and non-fatal errors for CPU, memory, and PCIe
> really help?

It provides a quick indication that hardware issues have occurred, which
can prompt the operator to investigate further via RAS events.

That said, Tony proposed a more robust approach—categorizing and
tracking errors by their source. This would involve maintaining separate
counters for each source using an counter per enum type:

enum recovered_error_sources {
ERR_GHES,
ERR_MCE,
ERR_AER,
...
ERR_NUM_SOURCES
};

See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/

Do you think this would help you by any chance?

Thanks
--breno