Re: [PATCH 5/5] add documentation about kvmclock

From: Randy Dunlap
Date: Thu Apr 15 2010 - 15:29:23 EST


On Thu, 15 Apr 2010 14:37:28 -0400 Glauber Costa wrote:

> This patch adds a new file, kvm/kvmclock.txt, describing
> the mechanism we use in kvmclock.
>
> Signed-off-by: Glauber Costa <glommer@xxxxxxxxxx>
> ---
> Documentation/kvm/kvmclock.txt | 138 ++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 138 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/kvm/kvmclock.txt
>
> diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt
> new file mode 100644
> index 0000000..21008bb
> --- /dev/null
> +++ b/Documentation/kvm/kvmclock.txt
> @@ -0,0 +1,138 @@
> +KVM Paravirtual Clocksource driver
> +Glauber Costa, Red Hat Inc.
> +==================================
> +
> +1. General Description
> +=======================
> +
...
> +
> +2. kvmclock basics
> +===========================
> +
> +When supported by the hypervisor, guests can register a memory page
> +to contain kvmclock data. This page has to be present in guest's address space
> +throughout its whole life. The hypervisor continues to write to it until it is
> +explicitly disabled or the guest is turned off.
> +
> +2.1 kvmclock availability
> +-------------------------
> +
> +Guests that want to take advantage of kvmclock should first check its
> +availability through cpuid.
> +
> +kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates
> +the present of kvmclock. Bit 0 indicates that kvmclock is present, but the

presence
but it's confusing. Is it bit 3 or bit 0? They seem to indicate the same thing.

> +old MSR set must be used. See section 2.3 for details.

"old MSR set": what does this mean?

> +
> +2.2 kvmclock functionality
> +--------------------------
> +
> +Two MSRs are provided by the hypervisor, controlling kvmclock operation:
> +
> + * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and
> + * MSR_KVM_SYSTEM_TIME, value 0x4b564d01.
> +
> +The first one is only used in rare situations, like boot-time and a
> +suspend-resume cycle. Data is disposable, and after used, the guest
> +may use it for something else. This is hardly a hot path for anything.
> +The Hypervisor fills in the address provided through this MSR with the
> +following structure:
> +
> +struct pvclock_wall_clock {
> + u32 version;
> + u32 sec;
> + u32 nsec;
> +} __attribute__((__packed__));
> +
> +Guest should only trust data to be valid when version haven't changed before

has not

> +and after reads of sec and nsec. Besides not changing, it has to be an even
> +number. Hypervisor may write an odd number to version field to indicate that
> +an update is in progress.
> +
> +MSR_KVM_SYSTEM_TIME, on the other hand, has persistent data, and is
> +constantly updated by the hypervisor with time information. The data
> +written in this MSR contains two pieces of information: the address in which
> +the guests expects time data to be present 4-byte aligned or'ed with an
> +enabled bit. If one wants to shutdown kvmclock, it just needs to write
> +anything that has 0 as its last bit.
> +
> +Time information presented by the hypervisor follows the structure:
> +
> +struct pvclock_vcpu_time_info {
> + u32 version;
> + u32 pad0;
> + u64 tsc_timestamp;
> + u64 system_time;
> + u32 tsc_to_system_mul;
> + s8 tsc_shift;
> + u8 pad[3];
> +} __attribute__((__packed__));
> +
> +The version field plays the same role as with the one in struct
> +pvclock_wall_clock. The other fields, are:
> +
> + a. tsc_timestamp: the guest-visible tsc (result of rdtsc + tsc_offset) of
> + this cpu at the moment we recorded system_time. Note that some time is

CPU (please)

> + inevitably spent between system_time and tsc_timestamp measurements.
> + Guests can subtract this quantity from the current value of tsc to obtain
> + a delta to be added to system_time

to system_time.

> +
> + b. system_time: this is the most recent host-time we could be provided with.
> + host gets it through ktime_get_ts, using whichever clocksource is
> + registered at the moment

moment.

> +
> + c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied
> + by in order to obtain time in nanoseconds. Hypervisor is free to change
> + this value in face of events like cpu frequency change, pcpu migration,

CPU

> + etc.
> +
> + d. tsc_shift: guests must shift

missing text??

> +
> +With this information available, guest calculates current time as:
> +
> + T = kt + to_nsec(tsc - tsc_0)
> +
> +2.3 Compatibility MSRs
> +----------------------
> +
> +Guests running on top of older hypervisors may have to use a different set of
> +MSRs. This is because originally, kvmclock MSRs were exported within a
> +reserved range by accident. Guests should check cpuid leaf 0x40000001 for the
> +presence of kvmclock. If bit 3 is disabled, but bit 0 is enabled, guests can
> +have access to kvmclock functionality through
> +
> + * MSR_KVM_WALL_CLOCK_OLD, value 0x11 and
> + * MSR_KVM_SYSTEM_TIME_OLD, value 0x12.
> +
> +Note, however, that this is deprecated.
> +
> +3. Migration
> +============
> +
> +Two ioctls are provided to aid the task of migration:
> +
> + * KVM_GET_CLOCK and
> + * KVM_SET_CLOCK
> +
> +Their aim is to control an offset that can be summed to system_time, in order
> +to guarantee monotonicity on the time over guest migration. Source host
> +executes KVM_GET_CLOCK, obtaining the last valid timestamp in this host, while
> +destination sets it with KVM_SET_CLOCK. It's the destination responsibility to
> +never return time that is less than that.


---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/