Re: [PATCH v2] KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID

From: Jim Mattson
Date: Wed Jan 25 2023 - 17:10:15 EST


On Wed, Jan 25, 2023 at 1:46 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 1/25/23 17:47, Jim Mattson wrote:
> >> Part of the definition of the API is that you can take
> >> KVM_GET_SUPPORTED_CPUID and pass it to KVM_SET_CPUID2 for all vCPUs.
> >> Returning host topology information for a random host vCPU definitely
> >> violates the contract.
> >
> > You are attempting to rewrite history. Leaf 0xB was added to > KVM_GET_SUPPORTED_CPUID in commit 0771671749b5 ("KVM: Enhance guest
> > cpuid management"), and the only documentation of the
> > KVM_GET_SUPPORTED_CPUID ioctl at that time was in the commit message:
> >
> > - KVM_GET_SUPPORTED_CPUID: get all cpuid entries the host (and kvm)
> > supports
> >
> > [...] the intention was to return the
> > host topology information for the current logical processor.
>
> The handling of unknown features is so naive in that commit, that I
> don't think it is possible to read anything from the implementation; and
> it certainly should not be a paragon for a future-proof implementation
> of KVM_GET_SUPPORTED_CPUID.
>
> For example, it only hid _known_ CPUID leaves or features and passed the
> unknown ones through, which you'll agree is completely broken. It also
> didn't try to handle all leaves for which ECX might turn out to be
> significant---which happened for EAX=7 so the commit returns a wrong
> output for CPUID[EAX=7,ECX=0].EAX.
>
> In other words, the only reason it handles 0xB is because it was known
> to have subleaves.
>
> We can get more information about how userspace was intended to use it
> from the qemu-kvm fork, which at the time was practically the only KVM
> userspace. As of 2009 it was only checking a handful of leaves:
>
> https://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git/tree/target-i386/kvm.c?h=kvm-88#n133
>
> so shall we say that userspace is supposed to build each CPUID leaf one
> by one and use KVM_GET_SUPPORTED_CPUID2 for validation only? I think
> the first committed documentation agrees: "Userspace can use the
> information returned by this ioctl to construct cpuid information (for
> KVM_SET_CPUID2) that is consistent with hardware, kernel, and userspace
> capabilities, and with user requirements".
>
> However, that's the theory. "Do not break userspace" also involves
> looking at how userspace *really* uses the API, and make compromises to
> cater to those uses; which is different from rewriting history.
>
> And in practice, people basically stopped reading after "(for
> KVM_SET_CPUID2)".
>
> For example in kvmtool:
>
> kvm_cpuid->nent = MAX_KVM_CPUID_ENTRIES;
> if (ioctl(vcpu->kvm->sys_fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) < 0)
> die_perror("KVM_GET_SUPPORTED_CPUID failed");
>
> filter_cpuid(kvm_cpuid, vcpu->cpu_id);
>
> if (ioctl(vcpu->vcpu_fd, KVM_SET_CPUID2, kvm_cpuid) < 0)
> die_perror("KVM_SET_CPUID2 failed");
>
> where filter_cpuid only does minor adjustments that do not include 0xB,
> 0x1F and 0x8000001E. The result is a topology that makes no sense if
> host #vCPUs != guest #vCPUs, and which doesn't include the correct APIC
> id in EDX.
>
> https://github.com/kvmtool/kvmtool/blob/5657dd3e48b41bc6db38fa657994bc0e030fd31f/x86/cpuid.c
>
>
> crosvm does optionally attempt to pass through leaves 0xB and 0x1F, but
> it fails to adjust the APIC id in EDX. On the other hand it also passes
> through 0x8000001E if ctx.cpu_config.host_cpu_topology is false,
> incorrectly. So on one hand this patch breaks host_cpu_topology ==
> true, on the other hand it fixes host_cpu_topology == false on AMD
> processors.
>
> https://github.com/google/crosvm/blob/cc79897fc0813ee8412e6395648593898962ec82/x86_64/src/cpuid.rs#L121
>
>
> The rust-vmm reference hypervisor adjusts the APIC id in EDX for 0xB but
> not for 0x1F. Apart from that it passes through the host topology
> leaves, again resulting in nonsensical topology for host #vCPUs != guest
> #vCPUs.
>
> https://github.com/rust-vmm/vmm-reference/blob/5cde58bc955afca8a180585a9f01c82d6277a755/src/vm-vcpu-ref/src/x86_64/cpuid.rs
>
>
> Firecracker, finally, ignores KVM_GET_SUPPORTED_CPUID's output for 0xb
> and 0x8000001E (good!) but fails to do the same for 0x1F, so this patch
> is again a fix of sorts---having all zeroes in 0x1F is better than
> having a value that is valid but inconsistent with 0xB.
>
> https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f178bdf21ececb87caa16c1/src/cpuid/src/transformer/intel.rs#L120
> https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f178bdf21ececb87caa16c1/src/cpuid/src/transformer/amd.rs#L88
>
>
> So basically the only open source userspace that is penalized (but not
> broken, and also partly fixed) by the patch is crosvm. QEMU doesn't
> care, while firecracker/kvmtool/vmm-reference are a net positive.
>
> Paolo

The topology leaves returned by KVM_GET_SUPPORTED_CPUID *for over a
decade* have been passed through unmodified from the host. They have
never made sense for KVM_SET_CPUID2, with the unlikely exception of a
whole-host VM.

Our VMM populates the topology of the guest CPUID table on its own, as
any VMM must. However, it uses the host topology (which
KVM_GET_SUPPORTED_CPUID has been providing pass-through *for over a
decade*) to see if the requested guest topology is possible.

Changing a long-established ABI in a way that breaks userspace
applications is a bad practice. I didn't think we, as a community, did
that. I didn't realize that we were only catering to open source
implementations here.