Re: [PATCH v29] RAS: Add a tracepoint for reporting memorycontroller events

From: Borislav Petkov
Date: Wed Jun 06 2012 - 08:52:51 EST


On Wed, Jun 06, 2012 at 07:33:19AM -0300, Mauro Carvalho Chehab wrote:
> RAS: Add a tracepoint for reporting memory controller events
>
> From: Mauro Carvalho Chehab <mchehab@xxxxxxxxxx>

[ â ]

> The tracepoint printk will be displayed like:
>
> mc_event: [quant] (Corrected|Uncorrected|Fatal) error:[error msg] on memory stick [label] ([location] [edac_mc detail] [driver_d$
>
> Where:
> [quant] is the quantity of errors
> [error msg] is the driver-specific error message
> (e. g. "memory read", "bus error", ...);
> [location] is the location in terms of memory controller and
> branch/channel/slot, channel/slot or csrow/channel;
> [label] is the memory stick label;
> [edac_mc detail] describes the address location of the error
> and the syndrome;
> [driver detail] is driver-specifig error message details,
> when needed/provided (e. g. "area:DMA", ...)
>
> For example:
>
> mc_event: 1 Corrected error:memory read on memory stick DIMM_1A (mc:0 location:0:0:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)
>
> Of course, any userspace tools meant to handle errors should not parse
> the above data. They should, instead, use the binary fields provided by
> the tracepoint, mapping them directly into their Management Information
> Base.
>
> NOTE: The original patch was providing an additional mechanism for
> MCA-based trace events that also contained MCA error register data.
> However, as no agreement was reached so far for the MCA-based trace
> events, for now, let's add events only for memory errors.
> A latter patch is planned to change the tracepoint, for those types
> of event.
>
> Cc: Aristeu Rozanski <arozansk@xxxxxxxxxx>
> Cc: Doug Thompson <norsk5@xxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Signed-off-by: Mauro Carvalho Chehab <mchehab@xxxxxxxxxx>

Ok, this is starting to shape up, here's the output on my box here:

mcegen.py-3009 [008] .N.. 144.149649: mc_event: 1 Corrected error: amd64_edac on unknown memory (mc:0 location:3:1:-1 address:0x000007ba grain:2 syndrome:0x0000ac71)

Tony, any objections?

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/