Re: [PATCH] x86: mce: Xeon75xx specific interface to get correctedmemory error information v2

From: Hidetoshi Seto
Date: Mon Mar 29 2010 - 06:47:53 EST


(2010/03/29 18:01), Andi Kleen wrote:
>>>> Xeon 75xx doesn't log physical addresses on corrected machine check
>>>> events in the standard architectural MSRs. Instead the address has to
>>>> be retrieved in a model specific way. This makes it impossible
>>>> to do predictive failure analysis.
>>
>> Could you point proper specification or datasheet to know/check what
>> you are going to do here?
>
> You mean how the model specific interface works?
>
> There's currently no public specification for the interface,
> but it should be reasonably clear from reading the driver how
> it works.
>
> -Andi

It looks like overengineered...

I have some questions: Is it impossible to get the address
after polling handler have processed? e.g. Is it possible to
implement this module as mcelog's add-on that hooked & invoked
immediately after reading /dev/mcelog? I guess there are
some limitation/restriction to call pfa_command().

Are there any alternative way to get the address?
Polling like edac_i7 doesn't help this?

You pointed "This makes it impossible to do predictive failure
analysis", but I guess we could do rough-but-enough analysis that
requires coarse resolution like sockets. Or we should not expect
that one of DIMMs connected to the socket is broken if the socket
reports corrected memory errors many time?


Thanks,
H.Seto



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/