Re: [PATCH] ACPI/APEI: Clear GHES block_status before panic()

From: James Morse
Date: Fri Dec 21 2018 - 13:52:27 EST


On 21/12/2018 11:17, Rafael J. Wysocki wrote:
> On Thursday, December 20, 2018 8:24:47 PM CET Borislav Petkov wrote:
>> + James.

Thanks,

>> On Wed, Dec 19, 2018 at 11:50:52AM -0500, David Arcari wrote:
>>> From: Lenny Szubowicz <lszubowi@xxxxxxxxxx>
>>>
>>> In __ghes_panic() clear the block status in the APEI generic
>>> error status block for that generic hardware error source before
>>> calling panic() to prevent a second panic() in the crash kernel
>>> for exactly the same fatal error.
>>>
>>> Otherwise ghes_probe(), running in the crash kernel, would see
>>> an unhandled error in the APEI generic error status block and
>>> panic again, thereby precluding any crash dump.

I bet that was fun to watch!


>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index 02c6fd9..f008ba7 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
>>> {
>>> __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
>>>
>>> + ghes_clear_estatus(ghes);
>>> +
>>> /* reboot to log the error! */
>>> if (!panic_timeout)
>>> panic_timeout = ghes_panic_timeout;
>>
>> Acked-by: Borislav Petkov <bp@xxxxxxx>
>
> Patch applied, thanks!

Great!

Do we need to ghes_ack_error() too?

With the location cleared the new kernel will never find the records, and
firmware can never re-use that location because it wasn't ack'd. The upshot is
RAS records can't be generated for the kdump kernel. The acpi spec talks about
use of the memory, so I don't think its fair for it to use this to disarm a
watchdog.

I think we can live with this as the kdump kernel isn't going to handle RAS
errors for the bulk of memory anyway.


Thanks,

James