Re: [PATCH V8 06/10] acpi: apei: panic OS with fatal error status block

From: James Morse
Date: Thu Feb 09 2017 - 05:58:55 EST


Hi Jonathan, Tyler,

On 01/02/17 17:16, Tyler Baicar wrote:
> From: "Jonathan (Zhixiong) Zhang" <zjzhang@xxxxxxxxxxxxxx>
>
> Even if an error status block's severity is fatal, the kernel does not
> honor the severity level and panic.
>
> With the firmware first model, the platform could inform the OS about a
> fatal hardware error through the non-NMI GHES notification type. The OS
> should panic when a hardware error record is received with this
> severity.
>
> Call panic() after CPER data in error status block is printed if
> severity is fatal, before each error section is handled.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8756172..86c1f15 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
> return rc;
> }
>
> +static void __ghes_call_panic(void)
> +{
> + if (panic_timeout == 0)
> + panic_timeout = ghes_panic_timeout;
> + panic("Fatal hardware error!");
> +}
> +

__ghes_panic() also has:
> __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);

Which prints this estatus regardless of rate limiting and cache-ing.

[ ... ]

> @@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes)
> if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))

ghes_print_estatus() uses some custom rate limiting '2 messages every 5
seconds', GHES_SEV_PANIC shares the same limit as GHES_SEV_RECOVERABLE.

I think its possible to get 2 recoverable messages, then a panic in a 5 second
window. The rate limit will kick in to stop the panic estatus block being
printed, but we still go on to call panic() without the real reason being printed...

(the caching thing only seems to consider identical messages, given we would
never see two panic messages, I don't think that will cause any problems.)

> ghes_estatus_cache_add(ghes->generic, ghes->estatus);
> }
> + if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
> + __ghes_call_panic();
> + }
> +

I think this ghes_severity() then panic() should go above the:
> if (!ghes_estatus_cached(ghes->estatus)) {
and we should call __ghes_print_estatus() here too, to make sure the message
definitely got out!


With that,
Reviewed-by: James Morse <james.morse@xxxxxxx>


Thanks,

James