Re: [PATCH v2] KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID

From: Jim Mattson
Date: Tue Jan 24 2023 - 18:17:00 EST


On Thu, Oct 27, 2022 at 2:21 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> Passing the host topology to the guest is almost certainly wrong
> and will confuse the scheduler. In addition, several fields of
> these CPUID leaves vary on each processor; it is simply impossible to
> return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
> they can be passed to KVM_SET_CPUID2.
>
> The values that will most likely prevent confusion are all zeroes.
> Userspace will have to override it anyway if it wishes to present a
> specific topology to the guest.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> ---
> Documentation/virt/kvm/api.rst | 14 ++++++++++++++
> arch/x86/kvm/cpuid.c | 32 ++++++++++++++++----------------
> 2 files changed, 30 insertions(+), 16 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index eee9f857a986..20f4f6b302ff 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8249,6 +8249,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
> It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
> has enabled in-kernel emulation of the local APIC.
>
> +CPU topology
> +~~~~~~~~~~~~
> +
> +Several CPUID values include topology information for the host CPU:
> +0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
> +versions of KVM return different values for this information and userspace
> +should not rely on it. Currently they return all zeroes.
> +
> +If userspace wishes to set up a guest topology, it should be careful that
> +the values of these three leaves differ for each CPU. In particular,
> +the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
> +for 0x8000001e; the latter also encodes the core id and node id in bits
> +7:0 of EBX and ECX respectively.
> +
> Obsolete ioctls and capabilities
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0810e93cbedc..164bfb7e7a16 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -759,16 +759,22 @@ struct kvm_cpuid_array {
> int nent;
> };
>
> +static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
> +{
> + if (array->nent >= array->maxnent)
> + return NULL;
> +
> + return &array->entries[array->nent++];
> +}
> +
> static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
> u32 function, u32 index)
> {
> - struct kvm_cpuid_entry2 *entry;
> + struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
>
> - if (array->nent >= array->maxnent)
> + if (!entry)
> return NULL;
>
> - entry = &array->entries[array->nent++];
> -
> memset(entry, 0, sizeof(*entry));
> entry->function = function;
> entry->index = index;
> @@ -945,22 +951,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> entry->edx = edx.full;
> break;
> }
> - /*
> - * Per Intel's SDM, the 0x1f is a superset of 0xb,
> - * thus they can be handled by common code.
> - */
> case 0x1f:
> case 0xb:
> /*
> - * Populate entries until the level type (ECX[15:8]) of the
> - * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
> - * the starting entry, filled by the primary do_host_cpuid().
> + * No topology; a valid topology is indicated by the presence
> + * of subleaf 1.
> */
> - for (i = 1; entry->ecx & 0xff00; ++i) {
> - entry = do_host_cpuid(array, function, i);
> - if (!entry)
> - goto out;
> - }
> + entry->eax = entry->ebx = entry->ecx = 0;
> break;
> case 0xd: {
> u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
> @@ -1193,6 +1190,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> entry->ebx = entry->ecx = entry->edx = 0;
> break;
> case 0x8000001e:
> + /* Do not return host topology information. */
> + entry->eax = entry->ebx = entry->ecx = 0;
> + entry->edx = 0; /* reserved */
> break;
> case 0x8000001F:
> if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
> --
> 2.31.1
>

This is a userspace ABI change that breaks existing hypervisors.
Please don't do this. Userspace ABIs are supposed to be inviolate.