Re: [PATCH] ghes: Track number of recovered hardware errors

From: Shuai Xue
Date: Wed Jul 16 2025 - 23:04:21 EST

Next message: Jinlong Mao: "Re: [PATCH v8 2/2] coresight: Add label sysfs node support"
Previous message: Sharan Kumar Muthu Saravanan: "[PATCH] ALSA: hda/realtek: Enable Mute LED on HP OMEN 16 Laptop xd000xx"
In reply to: Breno Leitao: "Re: [PATCH] ghes: Track number of recovered hardware errors"
Next in thread: Breno Leitao: "Re: [PATCH] ghes: Track number of recovered hardware errors"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

在 2025/7/16 20:42, Breno Leitao 写道:

hello Shuai,

On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:

My plan with this patch is to have a counter for hardware errors that
would be exposed to the crashdump. So, post-morten analyzes tooling can
easily query if there are hardware errors and query RAS information in
the right databases, in case it seems a smoking gun.

I see your point. But does using a single ghes_recovered_errors counter
to track all corrected and non-fatal errors for CPU, memory, and PCIe
really help?

It provides a quick indication that hardware issues have occurred, which
can prompt the operator to investigate further via RAS events.

That said, Tony proposed a more robust approach—categorizing and
tracking errors by their source. This would involve maintaining separate
counters for each source using an counter per enum type:

enum recovered_error_sources {
ERR_GHES,
ERR_MCE,
ERR_AER,
...
ERR_NUM_SOURCES
};

See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/

Do you think this would help you by any chance?

Thanks
--breno

Personally, I think this approach would be more helpful. Additionally, I
suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
Errors) together. This is especially important for memory errors, as CEs
occur much more frequently than UEs, but their impact is much smaller.

Thanks.
Shuai

Next message: Jinlong Mao: "Re: [PATCH v8 2/2] coresight: Add label sysfs node support"
Previous message: Sharan Kumar Muthu Saravanan: "[PATCH] ALSA: hda/realtek: Enable Mute LED on HP OMEN 16 Laptop xd000xx"
In reply to: Breno Leitao: "Re: [PATCH] ghes: Track number of recovered hardware errors"
Next in thread: Breno Leitao: "Re: [PATCH] ghes: Track number of recovered hardware errors"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]