Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors

From: Carlos Bilbao
Date: Tue Apr 05 2022 - 21:30:50 EST


On 4/5/2022 12:18 PM, Yazen Ghannam wrote:
> On Thu, Mar 31, 2022 at 11:38:49AM -0500, Carlos Bilbao wrote:
>> The MCE handler needs to understand the severity of the machine errors to
>> act accordingly. In the case of AMD, very few errors are covered in the
>> grading logic.
>>
>> Extend the MCEs severity grading of AMD to cover new types of machine
>> errors.
>>
>
> This patch does not add new types of machine errors. Please update the commit
> message (and cover letter) to be consistent with changes made between patch
> revisions.
>

I'm thinking "cover error cases not previously considered".

>> Signed-off-by: Carlos Bilbao <carlos.bilbao@xxxxxxx>
>> ---
>> arch/x86/kernel/cpu/mce/severity.c | 104 ++++++++++-------------------
>> 1 file changed, 37 insertions(+), 67 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
>> index 1add86935349..4d52eef21230 100644
>> --- a/arch/x86/kernel/cpu/mce/severity.c
>> +++ b/arch/x86/kernel/cpu/mce/severity.c
>> @@ -301,85 +301,55 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>> }
>> }
>>
>> -static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
>> -{
>> - u64 mcx_cfg;
>> -
>> - /*
>> - * We need to look at the following bits:
>> - * - "succor" bit (data poisoning support), and
>> - * - TCC bit (Task Context Corrupt)
>> - * in MCi_STATUS to determine error severity.
>> - */
>> - if (!mce_flags.succor)
>> - return MCE_PANIC_SEVERITY;
>> -
>> - mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
>> -
>> - /* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
>> - if ((mcx_cfg & MCI_CONFIG_MCAX) &&
>> - (m->status & MCI_STATUS_TCC) &&
>> - (err_ctx == IN_KERNEL))
>> - return MCE_PANIC_SEVERITY;
>> -
>> - /* ...otherwise invoke hwpoison handler. */
>> - return MCE_AR_SEVERITY;
>> -}
>> -
>> /*
>> - * See AMD Error Scope Hierarchy table in a newer BKDG. For example
>> - * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
>> + * See AMD PPR(s) section 3.1 Machine Check Architecture
>
> I don't know that section numbers will be consistent between different PPR
> versions, so having the section name is a good idea. The "Machine Check Error
> Handling" section is what the severity grading function is based on.
>

Ack

>> */
>> static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
>> {
>> - enum context ctx = error_context(m, regs);
>> + int ret;
>> +
>> + /*
>> + * Default return value: Action required, the error must be handled
>> + * immediately.
>> + */
>> + ret = MCE_AR_SEVERITY;
>>
>> /* Processor Context Corrupt, no need to fumble too much, die! */
>> - if (m->status & MCI_STATUS_PCC)
>> - return MCE_PANIC_SEVERITY;
>> + if (m->status & MCI_STATUS_PCC) {
>> + ret = MCE_PANIC_SEVERITY;
>> + goto amd_severity;
>> + }
>>
>> - if (m->status & MCI_STATUS_UC) {
>> + /*
>> + * Evaluate the severity of deferred errors for AMD systems, for which only
>> + * scrub error is interesting to notify an action requirement. The poll
>> + * handler catches deferred errors and adds to mce_ring so memorty-failure
>> + * can take recovery actions.
>> + */
>
> I think this whole comment can be dropped. The "scrub error" part is not
> correct. The polling function may find deferred errors, but they are most
> likely to be see by the deferred error interrupt handler on modern AMD
> systems. The "mce_ring" was removed a long time ago (in v4.3).
>

Ack

>> + if (m->status & MCI_STATUS_DEFERRED) {
>> + ret = MCE_DEFERRED_SEVERITY;
>> + goto amd_severity;
>> + }
>>
>> - if (ctx == IN_KERNEL)
>> - return MCE_PANIC_SEVERITY;
>> + /* If the UC bit is not set, the error has been corrected */
>
> This comment is not true. Deferred errors are an example of an uncorrectable
> error where UC is not set.
>

Ack

>> + if (!(m->status & MCI_STATUS_UC)) {
>> + ret = MCE_KEEP_SEVERITY;
>> + goto amd_severity;
>> + }
>>
>> - /*
>> - * On older systems where overflow_recov flag is not present, we
>> - * should simply panic if an error overflow occurs. If
>> - * overflow_recov flag is present and set, then software can try
>> - * to at least kill process to prolong system operation.
>> - */
>> - if (mce_flags.overflow_recov) {
>> - if (mce_flags.smca)
>> - return mce_severity_amd_smca(m, ctx);
>> -
>> - /* kill current process */
>> - return MCE_AR_SEVERITY;
>> - } else {
>> - /* at least one error was not logged */
>> - if (m->status & MCI_STATUS_OVER)
>> - return MCE_PANIC_SEVERITY;
>> - }
>> -
>> - /*
>> - * For any other case, return MCE_UC_SEVERITY so that we log the
>> - * error and exit #MC handler.
>> - */
>> - return MCE_UC_SEVERITY;
>> + if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
>> + || !mce_flags.succor) {
>
> I appreciate merged two cases together that have the same result. But I feel
> keeping them separate may be easier to follow. They can also each have their
> own code comments. Or keep them together and explain each within the same
> comment block.
>

I will divide these two cases.

> Also, there's a checkpatch "CHECK" here. You'll see it when using the
> "--strict" flag with checkpatch.
>
>> + ret = MCE_PANIC_SEVERITY;
>> + goto amd_severity;
>> }
>>
>> - /*
>> - * deferred error: poll handler catches these and adds to mce_ring so
>> - * memory-failure can take recovery actions.
>> - */
>> - if (m->status & MCI_STATUS_DEFERRED)
>> - return MCE_DEFERRED_SEVERITY;
>> + if (error_context(m, regs) == IN_KERNEL) {
>> + ret = MCE_PANIC_SEVERITY;
>> + }
>
> Braces aren't needed here. The previous comment about braces was for when
> there's a block of "if/else-if/else" statements. A single "if" statement with
> a single line doesn't need braces.
>

Ack

>>
>> - /*
>> - * corrected error: poll handler catches these and passes responsibility
>> - * of decoding the error to EDAC
>> - */
>> - return MCE_KEEP_SEVERITY;
>> +amd_severity:
>
> This label doesn't look right to me. Maybe I'm too used to seeing "out" and
> "err" labels.
>
> Please see "Documentation/process/coding-style.rst" section (7) "Centralized
> exiting of functions".
>
> Maybe something like "out_ret_severity" to indicate the code is going to exit
> and return the severity. Or maybe just use "out"? Maybe others have thoughts
> on this.
>

"out_amd_severity" sounds good to me.

> Thanks,
> Yazen

Will send updated pachset.

Thanks,
Carlos