On Wed, 16 Jul 2014 08:41:00 -0300
Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
On Wed, Jul 16, 2014 at 12:18:37PM +0200, Paolo Bonzini wrote:How is it less reliable than last_value logic?
Il 16/07/2014 11:52, Igor Mammedov ha scritto:
There are buggy hosts in the wild that advertise invariant
TSC and as result host uses TSC as clocksource, but TSC on
such host sometimes sporadically jumps backwards.
This causes kvmclock to go backwards if host advertises
PVCLOCK_TSC_STABLE_BIT, which turns off aggregated clock
accumulator and returns:
pvclock_vcpu_time_info.system_timestamp + offset
where 'offset' is calculated using TSC.
Since TSC is not virtualized in KVM, it makes guest see
TSC jumped backwards and leads to kvmclock going backwards
as well.
This is defensive patch that keeps per CPU last clock value
and ensures that clock will never go backwards even with
using PVCLOCK_TSC_STABLE_BIT enabled path.
I'm not sure that a per-CPU value is enough; your patch can make the
problem much less frequent of course, but I'm not sure neither
detection nor correction are 100% reliable. Your addition is
basically a faster but less reliable version of the last_value
logic.
that might be an option, but what value we need to store intoIf may be okay to have detection that is faster but not 100%
reliable. However, once you find that the host is buggy I think the
correct thing to do is to write last_value and kill
PVCLOCK_TSC_STABLE_BIT from valid_flags.
last_value?
To make sure that clock won't go back we need to track
it on all CPUs and store highest value to last_value, at this point
there is no point in switching to last_value path since we have to
track per CPU anyway.
Can we move detection to the host TSC clocksource driver ?
I haven't looked much at host side solution yet,
but to detection reliable it needs to be run constantly,
from read_native_tsc().
it's possible to put detection into check_system_tsc_reliable() but
that would increase boot time and it's not clear for how long test
should run to make detection reliable (in my case it takes ~5-10sec
to detect first failure).
Best we could at boot time is mark TSC as unstable on affected hardware,
but for this we need to figure out if it's specific machine or CPU issue
to do it properly. (I'm in process of finding out who to bug with it)
PS: it appears that host runs stably.
but kvm_get_time_and_clockread() is affected since it uses its own
version of do_monotonic()->vgettsc() which will lead to cycles
go backwards and overflow of nano secs in timespec. We should mimic
vread_tsc() here so not to run into this kind of issues.