Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

From: Maxim Levitsky
Date: Mon Dec 07 2020 - 07:18:27 EST


On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
> On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
> > + case KVM_SET_TSC_STATE: {
> > + struct kvm_tsc_state __user *user_tsc_state = argp;
> > + struct kvm_tsc_state tsc_state;
> > + u64 host_tsc, wall_nsec;
> > +
> > + u64 new_guest_tsc, new_guest_tsc_offset;
> > +
> > + r = -EFAULT;
> > + if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > + goto out;
> > +
> > + kvm_get_walltime(&wall_nsec, &host_tsc);
> > + new_guest_tsc = tsc_state.tsc;
> > +
> > + if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > + s64 diff = wall_nsec - tsc_state.nsec;
> > + if (diff >= 0)
> > + new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > + else
> > + new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > + }
> > +
> > + new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > + kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
>
> From a timekeeping POV and the guests expectation of TSC this is
> fundamentally wrong:
>
> tscguest = scaled(hosttsc) + offset
>
> The TSC has to be viewed systemwide and not per CPU. It's systemwide
> used for timekeeping and for that to work it has to be synchronized.
>
> Why would this be different on virt? Just because it's virt or what?
>
> Migration is a guest wide thing and you're not migrating single vCPUs.
>
> This hackery just papers over he underlying design fail that KVM looks
> at the TSC per vCPU which is the root cause and that needs to be fixed.

I don't disagree with you.
As far as I know the main reasons that kvm tracks TSC per guest are

1. cases when host tsc is not stable
(hopefully rare now, and I don't mind making
the new API just refuse to work when this is detected, and revert to old way
of doing things).

2. (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop
advertising this feature to the guest), or writing TSC directly which
is allowed by Intel's PRM:

"The RDMSR and WRMSR instructions read and write the time-stamp counter, treating the time-stamp counter as an
ordinary MSR (address 10H). In the Pentium 4, Intel Xeon, and P6 family processors, all 64-bits of the time-stamp
counter are read using RDMSR (just as with RDTSC). When WRMSR is used to write the time-stamp counter on
processors before family [0FH], models [03H, 04H]: only the low-order 32-bits of the time-stamp counter can be
written (the high-order 32 bits are cleared to 0). For family [0FH], models [03H, 04H, 06H]; for family [06H]],
model [0EH, 0FH]; for family [06H]], DisplayModel [17H, 1AH, 1CH, 1DH]: all 64 bits are writable."


But other than that I don't mind making TSC offset global per VM thing.
Paulo, what do you think about this?

Best regards,
Maxim Levitsky


>
> Thanks,
>
> tglx
>