Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

From: Radim KrÄmÃÅ
Date: Mon Sep 21 2015 - 16:00:36 EST


2015-09-21 12:52-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim KrÄmÃÅ wrote:
>> 2015-09-20 19:57-0300, Marcelo Tosatti:
>>> Is it counting from zero that breaks SLES10?
>>
>> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> while still keeping system time; we used to allow that, which means an
>> ABI breakage. (And we can't even say that guest's behaviour is against
>> the spec ...)
>
> Because this behaviour was not defined.

It is defined by implementation.

> Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> boot_vcpu_runs_old_kvmclock == false?
> The patch would be much simpler.


If you mean the hunk in cover letter, I don't like it because we presume
that no other guests were broken.

I really don't like it so I thought about other problems with
PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
It doesn't work well ;)

We don't want to guess what the guest wants so I'd go for the opt-in
paravirt feature unless counting from zero can be done in guest alone.

> The problem is, "selecting one read as the initial point" is inherently
> racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> but must be applied to all vcpus.

I don't think that is a problem.

Kvmclock has a notion of a global system_time in nanoseconds (one value
that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
to propagate system_time into guest VCPUs as precisely as possible with
the help of TSC.

sched_clock uses kvmclock to get nanoseconds since the system was
brought up and [1/2] only works with this abstracted ns count.
A poorly synchronized initial read is irrelevant because all VCPUs will
be using the same constant offset.
(We can never know the precise time anyway.)

> Besides:
>
> 1) Stable sched clock in guest does not depend on
> KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.

Yes, thanks, I will remove that requirement in v1. (We'd need to
improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)

Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
there is now one possible unsigned overflow: in case the clock was very
imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
time of initialization. It's very unlikely that we'll ever reach legal
overflow so I can add a condition there.

> 2) You rely on monotonicity across vcpus to perform
> the 'minus delta that was read on vcpu0' calculation, but
> monotonicity across vcpus can fail during runtime
> (say host clocksource goes tsc->hpet due to tsc instability).

That could be a problem, but I'm adding a VCPU independent constant to
all reads -- does the new code rely on monoticity in places where the
old one didn't?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/