Re: WARNING in kvm_arch_vcpu_ioctl_run

From: Sean Christopherson
Date: Thu Mar 16 2023 - 15:17:51 EST


+LKML (lore isn't picking this up for some reason) and a real subject

On Thu, Mar 16, 2023, zhangjianguo (A) wrote:
> Hi all,
>
> Install the 6.3.0-rc1 kernel on the x86 server, and execute the https://syzkaller.appspot.com/text?tag=ReproC&x=14b34300880000 test case, the call trace appears.
>
> [ +0.000028] ------------[ cut here ]------------
> [ +0.000002] WARNING: CPU: 36 PID: 73250 at arch/x86/kvm/x86.c:11060 kvm_arch_vcpu_ioctl_run+0x482/0x4b0 [kvm]
> [ +0.000086] Modules linked in: openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass rapl intel_cstate ixgbe ses intel_uncore mei_me enclosure pcspkr i2c_i801 mdio sunrpc mei intel_pch_thermal i2c_smbus joydev lpc_ich dca sg acpi_power_meter drm vhost_net tun vhost fuse vhost_iotlb tap ext4 mbcache jbd2 sd_mod crct10dif_pclmul ipmi_si ahci crc32_pclmul crc32c_intel libahci ipmi_devintf ghash_clmulni_intel mpt3sas libata sha512_ssse3 ipmi_msghandler wdat_wdt raid_class scsi_transport_sas dm_mod br_netfilter bridge stp llc nvme nvme_core t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 nbd
> [ +0.000077] CPU: 36 PID: 73250 Comm: run_vcpu_ioctrl Not tainted 6.3.0-rc1+ #2
> [ +0.000004] Hardware name: Huawei RH2288 V3/BC11HGSB0, BIOS 3.87 02/02/2018
> [ +0.000002] RIP: 0010:kvm_arch_vcpu_ioctl_run+0x482/0x4b0 [kvm]
> [ +0.000002] Call Trace:
> [ +0.000003] <TASK>
> [ +0.000003] kvm_vcpu_ioctl+0x279/0x680 [kvm]
> [ +0.000047] ? vfs_write+0x2c8/0x3d0
> [ +0.000006] __x64_sys_ioctl+0x8f/0xc0
> [ +0.000006] do_syscall_64+0x3f/0x90
> [ +0.000007] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [ +0.000002] </TASK>
> [ +0.000002] ---[ end trace 0000000000000000 ]---
>
> | } else {
> | WARN_ON_ONCE(vcpu->arch.pio.count);
> | WARN_ON_ONCE(vcpu->mmio_needed); // where the splat triggered
> | }

The splat occurs due to a longstanding (literally since KVM's inception) shortcoming
in KVM's uAPI. On an emulated MMIO write, KVM finishes the instruction before
exiting to userspace. This is necessary given how KVM's uAPI works, as outside
of REP string instructions, KVM doesn't provide a way to restart an instruction
that partially completed before the MMIO was encountered.

For the vast majority of _emulated_ instructions, this doesn't cause problems as
there is a single memory accesses, i.e. any exceptions on the instruction will
occur _before_ the MMIO write.

What's happening here is that a PUSHA triggers an MMIO write and then runs past
the segment limit, resulting in a #SS after the MMIO is queued. KVM injects the
#SS (well, tries to) and thus loses track of the MMIO, but never clears mmio_needed.

There's a second bug here that results in failed VM-Enter when KVM tries to inject
the #SS: KVM doesn't ignore drop error code when the vCPU is in Real Mode. This
too is a longstanding bug that has likely escaped notice because no real work code
runs in Real Mode _and_ gracefully handles exceptions.

My plan, pending testing, is to suppress the MMIO write + exception scenario since
the bug has been around for 15+ years without anyone noticing, let alone caring.
Fixing it properly would be a heck of a lot of complexity and code churn for no
real benefit.

And for the Real Mode exception bug, unless I'm missing something, the error code
can simply be suppressed when queueing the exception.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 237c483b1230..b3bf3a0d74ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -646,6 +646,9 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,

kvm_make_request(KVM_REQ_EVENT, vcpu);

+ /* Real Mode exceptions do not report error codes. */
+ has_error &= is_protmode(vcpu);
+
/*
* If the exception is destined for L2 and isn't being reinjected,
* morph it to a VM-Exit if L1 wants to intercept the exception. A
@@ -8883,6 +8886,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
}

if (ctxt->have_exception) {
+ WARN_ON_ONCE(vcpu->mmio_needed && !vcpu->mmio_is_write);
+ vcpu->mmio_needed = false;
r = 1;
inject_emulated_exception(vcpu);
} else if (vcpu->arch.pio.count) {