Re: [PATCH] amd64_edac: Build module on x86-32

From: Tomasz Pala
Date: Sun Nov 02 2014 - 09:08:55 EST


On Sun, Nov 02, 2014 at 13:35:38 +0100, Borislav Petkov wrote:

> Or do you want for amd64_edac to try to pinpoint which DIMMs are causing
> the errors too?

Yes - when error happens, it would be desirable to locate failing module.

> So were you able to confirm that those errors went away after replacing
> the DIMMs?

Can't say - such error (noticed) happened to me only once, how many silent bit
rots I've missed is hard to say, as I haven't got data checksums before.
The previous modules were well tested in this motherboard, so I can't
blame them nor any other component - it's a 'cosmic ray' situation.

OK, with EDAC_DECODE_MCE I would know if I should blame RAM or not. But
if UCE rate is 1/year I can't randomly remove modules and wait if the
problem is gone. Any single UCE should result in action that narrows
down the possibile causes. Other than 'replace entire RAM' obviously.

> First of all, you need to relax yourself. Just calm down a bit, maybe
> take a walk first. Take a deep breath, whatever helps.

OK, done. Sorry for being rude.

> I'm not talking about your time, energy and resources but about mine! I
> don't have 32-bit configurations to test 32-bit amd64_edac and am not
> willing to go buy any. So let me flip your question: are you going to
> test amd64_edac on 32-bit and fix issues when people report them?

1. Yes, I'm going to test, but no, I'm not capable of fixing it, sorry.
1a. There were other reporters you said, maybe some of them are capable.
2. Were there any op-mode specific issues in this code till now? Does
this differ from not having e.g. F10h hardware? If that happens, I
might grant remote access to such machine, but that's unfortunatelly all.
3. Didn't know that lack of resources to support discrepancies that might
occur (but not occuring right now) is valid reason for disabling module
entirely. After all, there are many parts that are not maintained
actively at all and nobody removes them preemptively. Back then it
could be (X86_64 || EXPERIMENTAL), couldn't now it be just a note in
the description?
4. To be honest I think that more people are abandoning x86-32 than
enabling ECC on them, so I wouldn't worry about people starting to use this
and report 32-bit related errors. If you got reports on this once per a few
months that's the order of magnitude we are talking about.

So, if 32-bit related error are real threat, not just an excuse, ENOTIME
for handling them is fair enough - people determined to have this
running like me will find their way. But please don't say it's not
_worth_ it, Kconfig descriptions are not a place to make such judgements
(as it's YOUR time vs MY data). I'd go for something more objective,
like "this driver might be run on 32-bit kernel, however no complains
would be accepted due to lack of resources to handle 32-bit specific
bugs").

Oh, and one more thing about the proposed description - I've noticed before:

[PATCH 01/16] amd64_edac: Remove F11h support
Fri, 26 Nov 2010 20:04:08 +0100

F11h doesn't support DRAM ECC so whack it away.

and I see F10h, F15h and F16h families only mentioned in amd64_edac.c.

regards,
--
Tomasz Pala <gotar@xxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/