Re: [PATCH] KVM: x86: use TPAUSE to replace PAUSE in halt polling

From: Sean Christopherson
Date: Wed Aug 24 2022 - 11:26:52 EST


On Wed, Aug 24, 2022, Dapeng Mi wrote:
> TPAUSE is a new instruction on Intel processors which can instruct
> processor enters a power/performance optimized state. Halt polling
> uses PAUSE instruction to wait vCPU is waked up. The polling time
> could be long and cause extra power consumption in some cases.
>
> Use TPAUSE to replace the PAUSE instruction in halt polling to get
> a better power saving and performance.

Better power savings, yes. Better performance? Not necessarily. Using TPAUSE
for a "successful" halt poll is likely to yield _worse_ performance from the
vCPU's perspective due to the increased latency.

> Signed-off-by: Dapeng Mi <dapeng1.mi@xxxxxxxxx>
> ---
> drivers/cpuidle/poll_state.c | 3 ++-
> include/linux/kvm_host.h | 20 ++++++++++++++++++++
> virt/kvm/kvm_main.c | 2 +-
> 3 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index f7e83613ae94..51ec333cbf80 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -7,6 +7,7 @@
> #include <linux/sched.h>
> #include <linux/sched/clock.h>
> #include <linux/sched/idle.h>
> +#include <linux/kvm_host.h>
>
> #define POLL_IDLE_RELAX_COUNT 200
>
> @@ -25,7 +26,7 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
> limit = cpuidle_poll_time(drv, dev);
>
> while (!need_resched()) {
> - cpu_relax();
> + kvm_cpu_poll_pause(limit);

poll_idle() absolutely should not be calling into KVM code.

> if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> continue;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f4519d3689e1..810e749949b7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -35,6 +35,7 @@
> #include <linux/interval_tree.h>
> #include <linux/rbtree.h>
> #include <linux/xarray.h>
> +#include <linux/delay.h>
> #include <asm/signal.h>
>
> #include <linux/kvm.h>
> @@ -2247,6 +2248,25 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> }
> #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
>
> +/*
> + * This function is intended to replace the cpu_relax function in
> + * halt polling. If TPAUSE instruction is supported, use TPAUSE
> + * instead fo PAUSE to get better power saving and performance.
> + * Selecting 1 us is a compromise between scheduling latency and
> + * power saving time.
> + */
> +static inline void kvm_cpu_poll_pause(u64 timeout_ns)
> +{
> +#ifdef CONFIG_X86

This is not preferred the way to insert arch-specific behavior into common KVM code.
Assuming the goal is to avoid a function call, use an #ifndef here and then #define
the flag in x86's kvm_host.h, e.g.

#ifndef CONFIG_HAVE_KVM_ARCH_HALT_POLL_PAUSE
static inline kvm_cpu_halt_poll_pause(u64 timeout_ns)
{
cpu_relax();
}
#endif

It's not obvious that we need to avoid a call here though, in which case a

__weak void kvm_arch_cpu_halt_poll_pause(struct kvm *kvm)
{

}

with an x86 implementation will suffice.


> + if (static_cpu_has(X86_FEATURE_WAITPKG) && timeout_ns > 1000)
> + udelay(1);

This is far too arbitrary. Wake events from other vCPU are not necessarily
accompanied by an IRQ, which means that delaying for 1us may really truly delay
for 1us before detecting a pending wake event.

If this is something we want to utilize in KVM, it should be controllable by
userspace, probably via module param, and likely off by default.

E.g.

unsigned int halt_poll_tpause_ns;

and then

if (timeout_ns >= halt_poll_tpause_ns)
udelay(halt_poll_tpause_ns);

with halt_poll_tpause_ns zeroed out during setup if TPAUSE isn't supported.

I say "if", because I think this needs to come with performance numbers to show
the impact on guest latency so that KVM and its users can make an informed decision.
And if it's unlikely that anyone will ever want to enable TPAUSE for halt polling,
then it's not worth the extra complexity in KVM.

> + else
> + cpu_relax();
> +#else
> + cpu_relax();
> +#endif
> +}
> +
> /*
> * This defines how many reserved entries we want to keep before we
> * kick the vcpu to the userspace to avoid dirty ring full. This
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 584a5bab3af3..4afa776d21bd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3510,7 +3510,7 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> */
> if (kvm_vcpu_check_block(vcpu) < 0)
> goto out;
> - cpu_relax();
> + kvm_cpu_poll_pause(vcpu->halt_poll_ns);

This is wrong, vcpu->halt_poll_ns is the total poll time, not the time remaining.
E.g. if the max poll time is 1001 ns, and KVM has already waited for 1000 ns, then
udelay(1) will cause KVM to wait for ~2000ns total. There's always going to be
some amount of overrun, but overrun by a few ns is quite different than overrun
by a few thousand ns.

> poll_end = cur = ktime_get();
> } while (kvm_vcpu_can_poll(cur, stop));
> }
> --
> 2.34.1
>