Re: [PATCH v2 0/5] Handle corrected machine check interrupt storms

From: Yazen Ghannam
Date: Fri Mar 17 2023 - 10:50:16 EST


On Mon, Jun 27, 2022 at 10:36:00AM -0700, Tony Luck wrote:
> Extend the logic of handling Intel's corrected machine check interrupt
> storms to AMD's threshold interrupts.
>
> First two patches are from Tony which cleans up the existing storm
> handling for Intel and proposes per CPU per bank storm handling.
>
> Third and fourth patches do some cleanup and refactoring on the CMCI
> storm handling in order to extend similar workaround for AMD's threshold
> interrupt storms. These two patches could be merged into Tony's second
> patch of CMCI storm mitigation.
>
> AMD's storm mitigation for threshold interrupts also relies on per CPU
> per bank approach similar to Intel. But unlike CMCI storm handling it does
> not set thresholds to reduce rate of interrupts on a storm. Rather it
> turns off the interrupt on the current CPU and bank if there is a storm
> and re-enables back the interrupts when the storm subsides.
>
> It is okay to turn off threshold interrupts on AMD systems as other error
> severities continue to be handled even if the threshold interrupts are
> turned off. Uncorrected errors will generate a #MC and deferred errors
> have a unique separate deferred error interrupt. The final patch adds
> support for handling threshold interrupt storms on AMD systems.
>
> Changes since v1:
>
> 1) Fix shift computation when keeping track of bank history. Shift
> should be "1" when a storm is in progress (because polling once per
> second). When a storm is not in progress shift should be based on
> number of seconds since the bank was last checked.
>
> 2) Changed Smita's code in part 0003 to avoid use of a function pointer
> (since the kernel is avoiding indirect branch points that might be
> trainable for various Spectre-like issues).
>
> Smita Koralahalli (2):
> x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms
> x86/mce: Handle AMD threshold interrupt storms
> x86/mce: Move storm handling to core.
>
> Tony Luck (3):
> x86/mce: Remove old CMCI storm mitigation code
> x86/mce: Add per-bank CMCI storm mitigation
>
> arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
> arch/x86/kernel/cpu/mce/core.c | 139 +++++++++++++++++-----
> arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
> arch/x86/kernel/cpu/mce/internal.h | 33 ++++--
> 4 files changed, 230 insertions(+), 170 deletions(-)
>
> --

Hi Tony,

Is there an updated version of this set? I can help review and test. Smita is
focusing on other items at the moment.

Thanks!

-Yazen