hello Shuai,
On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
My plan with this patch is to have a counter for hardware errors that
would be exposed to the crashdump. So, post-morten analyzes tooling can
easily query if there are hardware errors and query RAS information in
the right databases, in case it seems a smoking gun.
I see your point. But does using a single ghes_recovered_errors counter
to track all corrected and non-fatal errors for CPU, memory, and PCIe
really help?
It provides a quick indication that hardware issues have occurred, which
can prompt the operator to investigate further via RAS events.
That said, Tony proposed a more robust approach—categorizing and
tracking errors by their source. This would involve maintaining separate
counters for each source using an counter per enum type:
enum recovered_error_sources {
ERR_GHES,
ERR_MCE,
ERR_AER,
...
ERR_NUM_SOURCES
};
See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/
Do you think this would help you by any chance?
Thanks
--breno