Re: [PATCH] core_pattern: add CPU specifier

From: Renaud Métrich
Date: Thu Sep 08 2022 - 02:45:57 EST


Hello,

I have been working closely with Oleksandr on a couple of cases where customers could see segfaults for various processes, including basic tools ("grep", "cut", etc.) that usually don't die.

The coredumps showed of course nothing because from userland's perspective there was nothing wrong, but just a bad pointer which couldn't be explained.

Memory testing (e.g. Memtest86+) and CPU testing (usually from hardware vendor) never showed any issue with the hardware as well, even though there was, probably because it required special conditions, such as specific load and/or thermal conditions.

The troubleshooting of such cases takes several weeks or even months, until we have enough evidence it's not the OS that is faulty, and it's always struggling.

Usually when we start getting kernel crashes, we are then happy because kernel crashes indicate the CPU the task was running on, and it seems to always be reliable enough information to point to faulty CPU. For other cases where no kernel crash could be observed, these are solved after requesting the customer to replace the hardware components, which is something difficult to explain since it usually costs the customer money and time.

I hope such feature will be helpful for everybody doing Linux support.

Renaud.

Le 9/7/22 à 17:53, Luis Chamberlain a écrit :
On Sat, Sep 03, 2022 at 08:43:30AM +0200, Oleksandr Natalenko wrote:
Statistically, in a large deployment regular segfaults may indicate a CPU issue.
Can you elaborate on this? How common is this observed to be true? Are
there any public findings or bugs where it showed this?

Luis

Attachment: OpenPGP_signature
Description: OpenPGP digital signature