Re: recalibrating x86 TSC during suspend/resume

From: Paolo Bonzini
Date: Fri Feb 22 2019 - 07:31:26 EST


On 22/02/19 12:44, Thomas Gleixner wrote:
>> The specific usecase I have is a workload within VMs that makes heavy
>> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
>> because only this clocksource gives enough granularity. The default
>> paravirtualized clock will return the same values via
>> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
>> short. This does not happen with 'clocksource=tsc'.

This shouldn't happen. clock_gettime(CLOCK_MONOTONIC) should be
monotonic increasing. Do you have a testcase?

The KVM clocksource is high-resolution and also TSC-based, the
difference is that it performs two multiplications instead of one. The
first uses TSC parameters from the host. The second, which is the one
in arch/x86/entry/vdso/vclock_gettime.c's do_hres function, will have a
1:1 multiplier (excluding adjtime shearing) because kvmclock already
returns nanoseconds.

> Newer Intels support TSC scaling for VMX, which could solve the problem. It
> affects TSC readout by:
>
> TSC = (read(HWTSC) * multiplier) >> 48
>
> So you can standarize on a TSC frequency accross a fleet. Not sure when
> that was introduced and no idea whether it's available on AMD.

It's Skylake (server parts only) or newer. AMD instead has had it
(almost) forever. QEMU 2.6 or newer will use it automatically across
live migration, if available.

> For a software solution we could try the following:
>
> 1) Provide the raw TSC frequency of the host to the guest in some magic
> software defined MSR or CPUID. If there is an existing mechanism, use
> that.

This shouldn't be needed for two reasons:

1) you could also use kvmclock's provided mult/shift

2) I am not convinced that kvmclock has the behavior that Olaf mentions,
and if it does it would be a bug.

Paolo