Re: [PATCH] ghes: Track number of recovered hardware errors

From: Shuai Xue
Date: Wed Jul 16 2025 - 23:04:21 EST




在 2025/7/16 20:42, Breno Leitao 写道:
hello Shuai,

On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
My plan with this patch is to have a counter for hardware errors that
would be exposed to the crashdump. So, post-morten analyzes tooling can
easily query if there are hardware errors and query RAS information in
the right databases, in case it seems a smoking gun.

I see your point. But does using a single ghes_recovered_errors counter
to track all corrected and non-fatal errors for CPU, memory, and PCIe
really help?

It provides a quick indication that hardware issues have occurred, which
can prompt the operator to investigate further via RAS events.

That said, Tony proposed a more robust approach—categorizing and
tracking errors by their source. This would involve maintaining separate
counters for each source using an counter per enum type:

enum recovered_error_sources {
ERR_GHES,
ERR_MCE,
ERR_AER,
...
ERR_NUM_SOURCES
};

See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/

Do you think this would help you by any chance?

Thanks
--breno


Personally, I think this approach would be more helpful. Additionally, I
suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
Errors) together. This is especially important for memory errors, as CEs
occur much more frequently than UEs, but their impact is much smaller.

Thanks.
Shuai