Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks

From: Dave Hansen
Date: Thu Nov 15 2018 - 10:40:25 EST


On 11/14/18 3:00 PM, Aubrey Li wrote:
> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
> frequency drop. Add per task Hi16_ZMM state tracking to context switch.

Just curious, but is there any public documentation of this? It seems
really odd to me that something using the same AVX-512 instructions on
some low-numbered registers would behave differently than the same
instructions on some high-numbered registers. I'm not saying this is
wrong, but it's certainly counter-intuitive and I think that begs for
some more explanation.

> The tracking turns on the usage flag immediately, but requires 3
> consecutive context switches with no usage to clear it. This decay is
> required because of AVX-512 using tasks could set Hi16_ZMM state back
> to the init state themselves.

It would be nice to not assume that folks reading this changelog know
what XSAVE 'init states' are. In fact, that comment you have in the
function below would be great here, but probably shouldn't be in the
comment.

I would say it even more strongly than that: Part of the best practices
for using AVX-512 is to use the VZEROUPPER instruction to zero out some
state when the AVX-512 operation is finished. Unlike all the other FPU
and AVX state before it, this means that the Hi16_ZMM is expected to
frequently transition back to the "init state" in normal use. This
might cause this detection mechanism to frequently miss tasks that
actually use AVX-512. To fix that, add a decay.

> Signed-off-by: Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
> Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
> ---
> arch/x86/include/asm/fpu/internal.h | 26 ++++++++++++++++++++++++++
> arch/x86/include/asm/fpu/types.h | 9 +++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index a38bf5a..f382449 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -275,6 +275,31 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
> : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
> : "memory")
>
> +#define HI16ZMM_STATE_DECAY_COUNT 3
> +/*
> + * This function is called during context switch to update Hi16_ZMM state
> + */
> +static inline void update_hi16zmm_state(struct fpu *fpu)
> +{
> + /*
> + * XSAVE header contains a state-component bitmap(xfeatures),
> + * which allows software to discover the state of the init
> + * optimization used by XSAVEOPT and XSAVES.

I don't think we need the XSAVE background here. Can you put this in
the changelog?

> + * Hi16_ZMM state(one state of AVX-512 component) is tracked here
> + * because its usage could cause notable core turbo frequency drop.

I'd leave just this part of the comment.

> + * AVX512-using tasks could set Hi16_ZMM state back to the init
> + * state themselves. Thus, this tracking mechanism can miss.

Can you make this a stronger statement, just like the changelog?

> + * The decay usage ensures that false-negatives do not immediately
> + * make a task be considered as not using Hi16_ZMM registers.
> + */

To ensure that false-negatives do not immediately show up, decay the
usage count over time.


> + *
> + * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
> + * A value of non-zero is used to indicate whether there is valid
> + * state in these AVX512 registers.
> + */

>
> /*
> + * @hi16zmm_usage:
> + *
> + * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
> + * A value of non-zero is used to indicate whether there is valid
> + * state in these AVX512 registers.
> + */
> + unsigned char hi16zmm_usage;
> +

Nit: With the decay, this does not indicate register state. It
indicates whether the registers recently had state.