Re: [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions

From: M K, Muralidhara
Date: Thu Oct 26 2023 - 05:42:39 EST


Hi Boris,

On 10/26/2023 12:38 AM, Borislav Petkov wrote:
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


On Wed, Oct 25, 2023 at 05:14:52AM +0000, Muralidhara M K wrote:
The SMCA error decoding already exists in rasdaemon and future bank decoding
is supported from below patches merged in rasdaemon.
https://github.com/mchehab/rasdaemon/commit/1f74a59ee33b7448b00d7ba13d5ecd4918b9853c rasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types
https://github.com/mchehab/rasdaemon/commit/2d15882a0cbfce0b905039bebc811ac8311cd739 rasdaemon: Handle reassigned bit definitions for UMC bank


I'm still missing here the exact steps a user needs to do in order to
decode such an error.

Please inject an error, catch the error message and show me how one is
supposed to decode it with rasdaemon in case the daemon is not running
while the error happens or the error is fatal and the machine doesn't
even get to run userspace.

If that is not possible with rasdaemon yet, then this patch should not
remove the error descriptions but limit them only to the families for
which they're valid.

Bottom line is, I don't want to have the situation mcelog is in where
decoding errors with it is a total disaster.

IOW, I'd like error decoding on AMD to always work and be trivially easy
to do.


I have injected error, dmesg log below

[ 3991.560180] mce: [Hardware Error]: Machine check events logged
[ 3991.560195] [Hardware Error]: Corrected error, no action required.
[ 3991.567119] [Hardware Error]: CPU:2 (19:90:0) MC25_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[ 3991.579205] [Hardware Error]: Error Addr: 0x0000000000000040
[ 3991.585546] [Hardware Error]: PPIN: 0xabcdef0000000000
[ 3991.591302] [Hardware Error]: IPID: 0x0000009600792f00, Syndrome: 0x000000000a000000
[ 3991.599977] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[ 3991.599985] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

From above logs, "Ext. Error Code: 0" here we are printing only the error code and from this patch error strings have been removed.
User can refer the PPR to check what the error code refers to.
or rasdaemon tool can print the respective error string for particular error code.



Executed rasdaemon:

rasdaemon: Listening to events for cpus 0 to 191
<...>-1420 [002] .... 0.000399 mce_record 2023-10-26 04:28:37 -0500 Unified Memory Controller (bank=25), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca=DRAM On Die ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_die_id=0, cpu_type= AMD Scalable MCA, cpu= 2, socketid= 0, misc= d01a000201000000, addr= 40, synd= a000000, ipid= 9600792f00, mcgstatus=0, mcgcap= 140, apicid= 4

From logs, We can see "DRAM On Die ECC error" which is for Ext Err Code: 0
So, in rasdaemon Error strings are maintained.


Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette