[PATCH RFC 0/1] Make vCPUs that are HLT state candidates for load balancing

From: Masanori Misono
Date: Wed May 26 2021 - 09:37:41 EST


Hi,

I observed performance degradation when running some parallel programs on a
VM that has (1) KVM_FEATURE_PV_UNHALT, (2) KVM_FEATURE_STEAL_TIME, and (3)
multi-core architecture. The benchmark results are shown at the bottom. An
example of libvirt XML for creating such VM is

```
[...]
<vcpu placement='static'>8</vcpu>
<cpu mode='host-model'>
<topology sockets='1' cores='8' threads='1'/>
</cpu>
<qemu:commandline>
<qemu:arg value='-cpu'/>
<qemu:arg value='host,l3-cache=on,+kvm-pv-unhalt,+kvm-steal-time'/>
</qemu:commandline>
[...]
```

I investigate the cause and found that the problem occurs in the following
ways:

- vCPU1 schedules thread A, and vCPU2 schedules thread B. vCPU1 and vCPU2
share LLC.
- Thread A tries to acquire a lock but fails, resulting in a sleep state
(via futex.)
- vCPU1 becomes idle because there are no runnable threads and does HLT,
which leads to HLT VMEXIT (if idle=halt, and KVM doesn't disable HLT
VMEXIT using KVM_CAP_X86_DISABLE_EXITS).
- KVM sets vCPU1's st->preempted as 1 in kvm_steal_time_set_preempted().
- Thread C wakes on vCPU2. vCPU2 tries to do load balancing in
select_idle_core(). Although vCPU1 is idle, vCPU1 is not a candidate for
load balancing because is_vcpu_preempted(vCPU1) is true, hence
available_idle_cpu(vPCU1) is false.
- As a result, both thread B and thread C stay in the vCPU2's runqueue, and
vCPU1 is not utilized.

The patch changes kvm_arch_cpu_put() so that it does not set st->preempted
as 1 when a vCPU does HLT VMEXIT. As a result, is_vcpu_preempted(vCPU)
becomes 0, and the vCPU becomes a candidate for CFS load balancing.

The followings are parts of benchmark results of NPB-OMP
(https://www.nas.nasa.gov/publications/npb.html), which contains several
parallel computing programs. My machine has two nodes, and each CPU has 24
cores (Intel Xeon Platinum 8160, hyper-threading disabled.) I created a VM
with 48 vCPU, and each vCPU is pinned to the corresponding pCPU. I also
created virtual NUMA so that the guest environment became as close as the
host. Values in the tables are execution time (seconds; lower is better).

| environmnent \ benchmark name | lu.C | mg.C | bt.C | cg.C |
|-------------------------------+--------+-------+-------+-------|
| host (Linux v5.13-rc3) | 50.67 | 14.67 | 54.77 | 20.08 |
| VM (sockets=48, cores=1) | 51.37 | 14.88 | 55.99 | 20.05 |
| VM (sockets=2, cores=24) | 170.12 | 23.86 | 75.95 | 40.15 |
| w/ this patch | 48.92 | 14.95 | 55.23 | 20.09 |


is_vcpu_preempted() is also used in PV spinlock implementations to mitigate
lock holder preemption problems, etc. A vCPU holding a lock does not do
HLT, so I think this patch doesn't affect them. However, pCPU may be
running the host's thread that has higher priority than a vCPU thread, and
in that case, is_vcpu_preempted() should return 0 ideally. I guess
its implementation would be a bit complicated, so I wonder if this patch
approach is acceptable.

Thanks,

Masanori Misono (1):
KVM: x86: Don't set preempted when vCPU does HLT VMEXIT

arch/x86/kvm/x86.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)


base-commit: c4681547bcce777daf576925a966ffa824edd09d
--
2.31.1