Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"

From: Marcelo Tosatti
Date: Tue Sep 22 2015 - 15:04:32 EST


On Tue, Sep 22, 2015 at 06:33:46PM +0200, Radim KrÄmÃÅ wrote:
> PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with it.
> All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW).
> The reverted patch treated the MSR write as a one-shot initializer:
> any write from VCPU 0 would reset system_time.
>
> And this is what broke for Linux guests:
> * Onlining/hotplugging of VCPU 0
>
> VCPU has to assign an address to KVM clock before use, which is done
> with MSR_KVM_SYSTEM_TIME_NEW. Linux has an idea that time should not
> jump backward, so any `sleep` won't return before |system_time at the
> point of offline| elapses since the online. Be sure to run ntp.
>
> * S3 and S4 resume
>
> If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will freeze
> for |system_time at the point of suspend|, because pvclock ensures
> monoticity and kvmclock did not think about it.
>
> If you have stable clock, execution will resume immediately, but
> restoring KVM clock writes to the MSR and dmesg starts to count from
> zero. It's better than the onlining, but not what we want either.
>
> * Boot of SLES 10 guest
>
> SLES 10 has a custom implementation of kvm clock that calls
> MSR_KVM_SYSTEM_TIME before every read to enhance precision ...
> Two things are happening at the same time:
> 1) The guest periodically receives an interrupt that is handled by
> main_timer_handler():
> a) get time using the kvm clock:
> 1) write the address to MSR_KVM_SYSTEM_TIME
> 2) read tsc and pvclock (tsc_offset, system_time)
> 3) time = tsc - tsc_offset + system_time
> b) compute time since the last main_timer_handler()
> c) bump jiffies if enough time has elapsed
> 2) the guest wants to calibrate loops per jiffy [1]:
> a) read tsc
> b) loop till jiffies increase
> c) compute lpj
>
> Because (1a1) always resets the system_time to 0, we read the same
> value over and over so the condition for (1c) is never true and
> jiffies remain constant. A hang happens in (2b) as it is the first
> place that depends on jiffies.
>
>
> We could make hypervisor workaround for this, but that is just asking
> for more trouble. Luckily, reverting does not break to guests that
> learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways.
> Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, which
> is a good trade-off for not regressing.
>
> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Radim KrÄmÃÅ <rkrcmar@xxxxxxxxxx>
> ---
> v1: Extended commit message based on a discussion with Marcelo
>
> arch/x86/include/asm/pvclock-abi.h | 1 +
> arch/x86/kvm/x86.c | 4 ----
> 2 files changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
> index 655e07a48f6c..67f08230103a 100644
> --- a/arch/x86/include/asm/pvclock-abi.h
> +++ b/arch/x86/include/asm/pvclock-abi.h
> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
>
> #define PVCLOCK_TSC_STABLE_BIT (1 << 0)
> #define PVCLOCK_GUEST_STOPPED (1 << 1)
> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
> #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
> #endif /* __ASSEMBLY__ */
> #endif /* _ASM_X86_PVCLOCK_ABI_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bca39f0fdb3..71731994d897 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> vcpu->pvclock_set_guest_stopped_request = false;
> }
>
> - pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> -
> /* If the host uses TSC clocksource, then it is stable */
> if (use_master_clock)
> pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> @@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> &vcpu->requests);
>
> ka->boot_vcpu_runs_old_kvmclock = tmp;
> -
> - ka->kvmclock_offset = -get_kernel_ns();
> }
>
> vcpu->arch.time = data;
> --
> 2.5.3

NACK, please use original patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/