Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

From: Andy Lutomirski
Date: Thu Apr 15 2021 - 12:24:52 EST


On Wed, Apr 14, 2021 at 2:48 PM Len Brown <lenb@xxxxxxxxxx> wrote:
>

>
> > Then I take the transition penalty into and out of AMX code (I'll
> > believe there is no penalty when I see it -- we've had a penalty with
> > VEX and with AVX-512) and my program runs *slower*.
>
> If you have a clear definition of what "transition penalty" is, please share it.

Given the generally awful state of Intel's documentation about these
issues, it's quite hard to tell for real. But here are some examples.

VEX: Figures 11-1 ("AVX-SSE Transitions in the Broadwell, and Prior
Generation Microarchitectures") and 11-2 ("AVX-SSE Transitions in the
Skylake Microarchitecture"). We *still* have a performance regression
in the upstream kernel because, despite all common sense, the CPUs
consider LDMXCSR to be an SSE instruction and VLDMXCSR to be an AVX
instruction despite the fact that neither one of them touch the XMM or
YMM state at all.

AVX-512:

https://lore.kernel.org/linux-crypto/CALCETrU06cuvUF5NDSm8--dy3dOkxYQ88cGWaakOQUE4Vkz88w@xxxxxxxxxxxxxx/

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html





>
> Lacking one, I'll assume you are referring to the
> impact on turbo frequency of using AMX hardware?
>
> Again...
>
> On the hardware that supports AMX, there is zero impact on frequency
> due to the presence of AMX state, whether modified or unmodified.
>
> We resolved on another thread that Linux will never allow entry
> into idle with modified AMX state, and so AMX will have zero impact
> on the ability of the process to enter deep power-saving C-states.
>
> It is true that AMX activity is considered when determining max turbo.
> (as it must be)
> However, the *release* of the turbo credits consumed by AMX is
> "several orders of magnitude" faster on this generation
> than it was for AVX-512 on pre-AMX hardware.

What is the actual impact of a trivial function that initializes the
tile config, does one tiny math op, and then does TILERELEASE?

> Yes, the proposal, and the working patch set on the list, context
> switches XFD -- which is exactly what that hardware was designed to do.
> If the old and new tasks have the same value of XFD, the MSR write is skipped.
>
> I'm not aware of any serious proposal to context-switch XCR0,
> as it would break the current programming model, where XCR0
> advertises what the OS supports. It would also impact performance,
> as every write to XCR0 necessarily provokes a VMEXIT.

You're arguing against a nonsensical straw man.

In the patches, *as submitted*, if you trip the XFD #NM *once* and you
are the only thread on the system to do so, you will eat the cost of a
WRMSR on every subsequent context switch. This is not free. If we
use XCR0 (I'm not saying we will -- I'm just mentioning at a
possibility), then the penalty is presumably worse due to the VMX
issue.