Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues

From: Borislav Petkov
Date: Fri May 11 2012 - 13:24:50 EST


On Fri, May 11, 2012 at 09:37:48AM -0300, Mauro Carvalho Chehab wrote:
> > HW_ERR is the "official" prefix used by the MCE code in the kernel.
> > Maybe we can shorten it but it is needed to raise attention when staring
> > at dmesg output.
> >
> > Now, since this tracepoint is not dmesg, we don't need it there at all
> > since we know that trace_mc_error reports memory errors.
>
> IMO, we can get rid of it on trace, keeping it at dmesg.
>
> > "mc_error" is also not needed.
>
> Some name is required there. This parameter affects:
> - the name of the tracepoint function;
> - the TP_printk() output;
> - the trace filter name.
>
> mc_event is probably a good name.

If this tracepoint is going to report memory errors and we drop HW_ERR,
then I guess "mc_error" is the most fitting one because it says what the
message is.

> >> The error count msg ("1 error(s)") could be replaced by "count:1".
> >
> > Is there even a possibility to report more than one error when invoking
> > trace_mc_error once? If not, simply drop the count completely.
>
> Good point. It makes sense to add a count parameter to the error handling core,
> in order to handle "count". AFAIKT, the only two drivers that currently reports
> error counts are sb_edac and i7core_edac.

What does that mean? You report multiple errors with one tracepoint
invocation? How do you report error details then?

If you only report single errors, error count will always be 1 so drop
it.

> With regards to sb_edac, it also needs another logic to be handled by
> the core: channel_mask. This is required to handle lockstep mode and
> mirror mode.

What does "handle lockstep mode" mean and how is that important for me
staring at the tracepoint output.

[ â ]

> > Here's how it should look:
> >
> > kworker/u:6-201 [007] .N.. 161.136624: [Hardware Error]: memory read on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0 rank:1 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 channel_mask:1)
>
> As already explained: the error_msg is not consistent among the drivers.
>
> So, "memory read on ..." will work fine on sb_edac/i7core_edac but it _won't_
> work on other drivers.
>
> Changing this field on each driver requires deep knowledge of each memory
> controller, and not all datasheets are publicly available.
>
> For example, this is the code for Freescale MPC85xx EDAC driver:
>
> if (err_detect & DDR_EDE_SBE)
> edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> pfn, err_addr & ~PAGE_MASK, syndrome,
> row_index, 0, -1,
> mci->ctl_name, "", NULL);
>
> if (err_detect & DDR_EDE_MBE)
> edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> pfn, err_addr & ~PAGE_MASK, syndrome,
> row_index, 0, -1,
> mci->ctl_name, "", NULL);
>
> It uses a blank value for the error message. putting the error_msg at the beginning,
> as you proposed would print:
>
> [Hardware Error]: on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0 )

that's fine since we know the tracepoint purpose:

"mc_error: on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0)"

is pretty clear to me, IMHO.

> > * count is gone
>
> While count is not properly addressed, I'll keep it on sb_edac/i7core_edac.
>
> The right thing here seems to move it to a core property and increment the error counters
> according with the error count.

Incrementing the error counters shouldn't have anything to do with the
error reporting.

> > * MC-drivers shouldn't say "error" when reporting an error
> > * UE/CE moves into the brackets
>
> It seems that both Tony and me agrees that UE/CE/Fatal should be outside the brackets.

Why?

> > * socket moves earlier in the brackets, and keep the whole deal hierarchical.
>
> Socket doesn't bring a very relevant information. It provides the CPU socket with the core that
> were running the code when an error was generated, not the CPU that were managing the memory.

Then drop it if it's not relevant.

> > * drop "err_code" what is that?
>
> In this case:
> u32 errcode = GET_BITFIELD(m->status, 0, 15);

Either decode it like I do in amd_decode_err_code() or remove it
completely - no naked bit values.

> > * drop second "socket"
> > * drop "area" Area "DRAM" - are there other?
>
> Yes. there are 4 types of area at sb_edac.

And they are?

> > * what is "channel_mask"?
>
> It is a bitmask mask showing all channels affected by an error, on sb_edac.
> Handing it is on my TODO list.

What can the user do with it when it sees it reported?

> > * move "rank" to earlier
>
> Why? This is the least relevant information provided by the driver-specific logic:
>
> snprintf(msg, sizeof(msg),
> "count:%d%s%s area:%s err_code:%04x:%04x socket:%d channel_mask:%ld rank:%d",
> core_err_cnt,
> overflow ? " OVERFLOW" : "",
> (uncorrected_error && recoverable) ? " recoverable" : "",
> area_type,
> mscod, errcode,
> socket,
> channel_mask,
> rank);
>
> Error count, overflow, recoverable bit, etc are more relevant than it, as it actually affects
> the user.
>
> Rank, on the other hand, might only help someone interested on replacing a DRAM chip. Still,
> it is doubtful that this would be used, in practice, as companies that replaces DIMM chip likely
> have some specific equipments to test the DIMMs.
>
> So, I almost dropped it. The only reason for me to not drop is that it allows me to compare
> the driver results with some other testing tools I'm using on driver's development.

Then drop it and keep a local version for your testing needs only.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/