Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll

From: Hidetoshi Seto
Date: Tue Jan 26 2010 - 04:07:28 EST


(2010/01/26 15:33), Borislav Petkov wrote:
> In the end, even if the info were correct, it is still not nearly enough
> for all the information you might need from a system. So you end up
> pulling a dozen of different tools just to get the info you need. So
> yes, I really do think we need a tool to get do the job done right and
> on any system. And this tool should be distributed with the kernel
> sources like perf is, so that you don't have to jump through hoops to
> pull the stuff (Esp. if you have to build everything everytime like
> Andreas does :)).

How about having a system file which can be maintained with kernel,
e.g. like /proc/hwinfo, /sys/devices/platform/hwinfo, or directory
with some files like /somewhere/hwinfo/{dmi,acpi,pci,...} etc.?

>> And since it's kernel
>> based it cannot do most of the interesting reactions. And it doesn't
>> have a usable interface to add user events.
>>
>> And yes having all that crap in syslog is completely useless, unless
>> you're debugging code.
>
> So basically, IMHO we need:
>
> 1. Resilient error reporting that reliably pushes decoded error info to
> userspace and/or network. That one might be tricky to do but we'll get
> there.

I think it would be better to think "error" is a subset of "event",
which could be reported if interested but otherwise be filtered.
Use of TRACE_EVENT() for mce event aim such approach at least.

> 2. Error severity grading and acting upon each type accordingly. This
> might need to be vendor-specific.

I think you mean severity grading in kernel.
Even if hardware reported an error and graded it as corrected, kernel
can escalate the severity, likely based on some threshold.

> 3. Proper error format suiting all types of errors.

As mentioned in Andi's PDF, CPER format is one of good candidate
available today, I think.
However we could invent more suitable one if needed.

> 4. Vendor-specific hooks where it is needed for in-kernel handling of
> certain errors (L3 cache index disable, for example).

Some difficulty would be there to add such hook in the UE handling path,
but anyway we can have it for the CE path. Just need to be organized.

> 5. Error thresholding, representation, etc all done in userspace (maybe
> even on a different machine).

(...BTW, how about putting mcelog tree under the /tools, Andi?)

> 6. Last but not least, and maybe this is wishful thinking, a good tool
> to dump hwinfo from the kernel. We do a great job of detecting that info
> already - we should do something with it, at least report it...

Of course I want to have a tool to get a summary (not full dump) of
current hardware status too: e.g.
$ cat ./hwinfo/faulty
WARN: DIMM @ slot X on node Y: 208 errors corrected in last 3 days
INFO: PCI 0000:NN:01.1: 1 error recovered 37 hours ago

> Let's see what the others think.
>
> Thanks.

Thanks,
H.Seto

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/