RE: [PATCH v.23-2] RAS: use tracepoint to handle hw issues

From: Luck, Tony
Date: Fri May 11 2012 - 13:02:42 EST


> For example:
>
> mc_event: Corrected error:memory read on memory stick "DIMM_1A" (mc:0 channel:0 slot:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)

This is looking so much better.

I looked through your examples from drivers on what text we might see
in the "memory read" position ... and agree that it would be a lot of
work to make them all come up with grammatically clean messages, especially
for all the poorly documented (or undocumented) "default/unknown/..." cases.

Back to my "does the casual user really need to know" soapbox. What different
actions do we expect a user to take when we tell them "read error" or "write
error" or "unknown error"? I'm beginning to think that this belongs inside
the brackets! Perhaps as: type:"memory read"?

Then we'd have:

mc_event: Corrected error on memory stick "DIMM_1A" (bunch of stuff for deep diagnosis by vendor)

Knowing that the error was Corrected/Uncorrected is vital to the user. It lets them know
the urgency with which they need to take action ... we need to educate them that a few
"Corrected" errors are perfectly normal and nothing to raise blood pressure about.

Knowing which memory stick was involved - also very important. If they do take action,
they should be able to swap out the memory stick that was the source of the problem.

Everything else is just for memory geeks like me, you and Boris (and OEMs who want to
diagnose root cause of problems they see by pattern analysis across errors from multiple
machines with DIMMS from different batches/vendors).

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/