Re: [patch 01/10] x86/fpu/signal: Clarify exception handling in restore_fpregs_from_user()

From: Luck, Tony
Date: Tue Aug 31 2021 - 14:39:41 EST


On Tue, Aug 31, 2021 at 09:39:30AM +0200, Borislav Petkov wrote:
> On Tue, Aug 31, 2021 at 02:34:16AM +0200, Thomas Gleixner wrote:
> No no, the great way to do error injection is the ACPI-spec'ed, firwmare
> implemented
>
> drivers/acpi/apei/einj.c
>
> Yap, you heard me right, firmware. And when you hear firmware, you can
> imagine how it all works in practice... Yeap, exactly.

You can imagine all you want. And if your imagination is based
on experiences with very old systems like Haswell (launched in 2015)
then you'd be right to be skeptical of firmware capabilities.

> We even wrote documentation what to do:
>
> Documentation/firmware-guide/acpi/apei/einj.rst
>
> But but, this is firmware so
>
> - it is f*cking broken in all ways imaginable

s/is/was/

>
> - if it works, it doesn't support the error type which you wanna inject

Memory errors now have very good coverage. Still some issues with PCIe injection.

> - if it does, enterprise sh*t hw has added value crap which analyzes and
> looks at hardware errors first</me rolls eyes, trying to remain serious>
> so you might get the error report if you get lucky.

Turn off eMCA in BIOS to avoid this.

> > The HW injection mechanisms definitely exist, but without documentation
> > they are useless. Intel still thinks that the secrecy around that stuff
> > is valuable and they can get away with those untestable mechanisms even
> > for their endeavours in the safety critical space.

The injection controls in the memory controller can only be accessed
in SMM mode. Some paranoia there that some ring0 attack could inject
errors at random intervals causing major costs to diagnose and replace
"failing" DIMMs. So documentation wouldn't help Linux because it just
can't twiddle the necessary bits in the h/w.

> My impression with error injection with hw people is just like what they
> do with perf counters: it counts *something* right? You should be happy
> that it does.

This was true <= Haswell. But definitely not true now. The h/w groups
now have validation teams that depend on ACPI/EINJ for many of their
system level tests. Those guys are serious about this stuff. While I'll
just inject 1000 errors on a single machine and call it good if it all
goes as expected, those folks have (small) clusters running injection
tests 24x7 for weeks at a time.

Downsides of ACPI/EINJ today:
1) Availability on production machines. It is always disabled by default
in BIOS. OEMs may not provide a setup option to turn it on (or may have
deleted the code to support it completely). Intel's pre-production servers
always have the code, and the setup option to enable.
2) Doesn't inject to 3D-Xpoint (that has its own injection method, but
it is annoying to have to juggle two methods).
3) Hard/impossible to inject into SGX memory (because BIOS is untrusted
and isn't allowed to do a store to push the poison data to DDR).

-Tony