Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues

From: Mauro Carvalho Chehab
Date: Wed May 16 2012 - 14:08:43 EST


Em 16-05-2012 12:47, Borislav Petkov escreveu:
> On Wed, May 16, 2012 at 12:16:35PM -0300, Mauro Carvalho Chehab wrote:
>>> This doesn't answer my question. My question was: "why can't 'detail'
>>> and 'driver_detail' be a single parameter, i.e. 'detail' and this way
>>> solve both pretty printing and getting binary data problems?
>>
>> This is the 24th version of this very same patch...
>
> This would have been the case if you'd split your patches and sent them
> in 10-15 patches sets like normal people, _after_ gathering all review feedback.

The first version had only 3 patches. Patches were being added there due to
the huge delay to get them reviewed/accepted.

> But, you wanted to fix EDAC and the whole world while at it and besides,
> your patches caused boot errors here and there.
>
> Then, you went and rebased the whole patchset after me reviewing one or
> two and incremented for that rebase the version number.

Because you requested that by not accepting incremental patches at the end
of the series.

> So v24 means nothing to me - you might just as well use it for your
> internal tracking.
>
>> In summary: all edac messages provide "detail" as this contains the
>> error location in terms of channel/slot. So, any MIB for EDAC could
>> handle those parameters properly. With regards to driver_detail, this
>> have per-driver details. So, per-driver MIB is required for them, if
>> some userspace program wants to properly store that information.
>>
>> Merging those two separate fields together only makes harder for
>> userspace to store the error detail information on their MIB.
>
> There's that MIB crap again.
>
> And it doesn't make it harder for anything because in userspace you can
> do everything with those strings, cut them, replace them, whatever your
> heart desires, even store the correct error detail information "on their
> MIB." Basically, you have one string in userspace and you can massage
> the hell out of it and even fit it to the MIB or whatever...

The rationale behind providing binary information instead of a printk is
to avoid kernel to spend time with printk formatting and userspace to
parse it and try to recover the original data.

If this weren't a requirement, the better would be to just not use any
tracepoint for errors at all.

Also, the API should be handled in a way that it will work on userspace.

I'm foreseen the need of implementing the error counters on userspace
on my plans for the edac-utils userspace (or at least, some advanced
error counter logic, like burst detection, decay, etc).

For that to be properly implemented, userspace may need to get access
to each of the components of the "detail" field.

>
> [ â ]
>
>>>>>>> mcegen.py-2868 [007] .N.. 178.261607: mc_event: Corrected error:amd64_edac on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0x34ed )
>>>>
>>>> CE/UE is there. The only change is that, instead of using acronyms, it is now
>>>> saying "Corrected error"/"Uncorrected error", as the idea is to provide something
>>>> that the user will better understand.
>>>
>>> Ok, so let's change the string output from the above version to:
>>>
>>> "mcegen.py-2868 [007] .N.. 178.261607: mc_event: amd64_edac: Corrected error on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0x34ed)"
>>>
>>> I.e., "mc_event" [DRIVER] [ERROR TYPE] [DETAIL].
>>
>> As I've explained, there is a consistency issue with the [ERROR MSG]. On the ones
>> that properly implement it, [ERROR MSG] is "read error", "spare copy", "bus error", ...
>>
>> If you think we need to add an extra field for the driver name, then let's add a
>> "driver_name" field there, but a driver's name shouldn't be merged with an
>> error type message, as it is abusing the syntax, and the syntax of each field
>> should be clear enough to allow it to be stored on a MIB.
>>
>> In any case, those drivers that fill the error msg with something else should
>> be changes to write there something like "ECC error" instead of "amd64_edac:".
>>
>> Btw, I'm almost convinced that we should break the "detail" into its components,
>> e. g., instead of one string, it should be:
>> int layer0, layer1, layer2; /* Location */
>> unsigned long memory_address; /* or maybe page/offset */
>> u32 grain;
>> unsigned syndrome;
>>
>> That would be better from MIB point of view.
>
> Dude, enough with this MIB crap already!

Your feedback about the patches are very welcome, but it seems that you
have no experience with MIB handling and/or with userspace logic for
error management, so, please stop nonsense about something that doesn't
seem to be on your expertise area.

> Simply move the EDAC_MOD_STR thing earlier in the tracepoint so that you
> can get:
>
> "mc_event:" EDAC_MOD_STR [ERROR_TYPE] [DETAIL]
>
> EDAC_MOD_STR is _always_ the driver name:
>
> 0 amd64_edac.h 148 #define EDAC_MOD_STR "amd64_edac"
> 1 amd76x_edac.c 23 #define EDAC_MOD_STR "amd76x_edac"

> 2 amd8111_edac.c 37 #define AMD8111_EDAC_MOD_STR "amd8111_edac"
> 3 amd8131_edac.c 37 #define AMD8131_EDAC_MOD_STR "amd8131_edac"
> 4 cpc925_edac.c 34 #define CPC925_EDAC_MOD_STR "cpc925_edac"

Those above use another name.

> 5 e752x_edac.c 28 #define EDAC_MOD_STR "e752x_edac"
> 6 e7xxx_edac.c 33 #define EDAC_MOD_STR "e7xxx_edac"
> 7 i3000_edac.c 21 #define EDAC_MOD_STR "i3000_edac"
> 8 i3200_edac.c 22 #define EDAC_MOD_STR "i3200_edac"
> 9 i5000_edac.c 31 #define EDAC_MOD_STR "i5000_edac"
> a i5400_edac.c 38 #define EDAC_MOD_STR "i5400_edac"
> b i7300_edac.c 36 #define EDAC_MOD_STR "i7300_edac"
> c i7core_edac.c 65 #define EDAC_MOD_STR "i7core_edac"
> d i82443bxgx_edac.c 36 #define EDAC_MOD_STR "i82443bxgx_edac"
> e i82860_edac.c 20 #define EDAC_MOD_STR "i82860_edac"
> f i82875p_edac.c 24 #define EDAC_MOD_STR "i82875p_edac"
> g i82975x_edac.c 20 #define EDAC_MOD_STR "i82975x_edac"
> h mpc85xx_edac.h 15 #define EDAC_MOD_STR "MPC85xx_edac"
> i mv64x60_edac.h 16 #define EDAC_MOD_STR "MV64x60_edac"
> j r82600_edac.c 26 #define EDAC_MOD_STR "r82600_edac"
> k sb_edac.c 38 #define EDAC_MOD_STR "sbridge_edac"
> l x38_edac.c 21 #define EDAC_MOD_STR "x38_edac"
>

Ok, I'll add a driver_name field with *EDAC_MOD_STR, on a separate
patchset (as this will require touching, again, on every single
driver).

Regards,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/