Re: [PATCH v2 2/2] mce: acpi/apei: Add a boot option to disable ffmode for corrected errors

From: Naveen N. Rao
Date: Thu Jun 20 2013 - 17:21:40 EST


On 06/20/2013 02:58 AM, Luck, Tony wrote:
Ok, where is that semantics? What in a CPER record does say "this error
should tell you that you need to offline the containing page and I'm
telling you this exactly only once"? Error Severity 0, i.e. Recoverable?

Naveen - this one is for you (or for your BIOS team). Can you get us a sample
CPER that you plan to provide when the BIOS decides that its threshold has
been exceeded? How will it be different from what old WSM-EX platforms
were sending to us? Hopefully the answer is encoded in the CPER record
and not in some code we have to put in Linux to say "if (IBMplatform) do_thing_1(); else ... "

Looking at the specs, there might be a few ways we can do this:
- One, Error threshold value of 1 in the Hardware Error Notification structure of CMC. This field is described as the number of error events before OS considers this as an error event. With a threshold value of 1, we are essentially asking the OS not to threshold further.
- Two, the Generic Error Data Entry (aka UEFI Section Descriptor) has a flag which indicates 'Error Threshold Exceeded'. From the UEFI spec, it looks like we could consider this as an indication to offline the page; though I am not sure if/how this relates to the threshold value above.

Thoughts?


Thanks,
Naveen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/