Re: [PATCHv1] kvm guest: fix uninitialized kvmclock read by KVM guest

From: Marcelo Tosatti
Date: Tue Jun 18 2013 - 18:22:25 EST


On Sat, Jun 15, 2013 at 10:01:45PM +0400, Eugene Batalov wrote:
> Due to unintialized kvmclock read KVM guest is hanging on SMP boot stage.
> If unintialized memory contains fatal garbage then hang reproduction is 100%.
> Unintialized memory is allocated by memblock_alloc. So the garbage values
> depend on many many things.
>
> See the detailed description of the bug and possible ways to fix it
> in the kernel bug tracker.
> "Bug 59521 - KVM linux guest reads uninitialized pvclock values before
> executing rdmsr MSR_KVM_WALL_CLOCK"
>
> I decided to fix it simply returning 0ULL from kvm_clock_read when
> kvm clocksource is not initialized yet.
> The same as kernel bootstrap CPU doesn on boot stage when kernel
> clocksources are not initialized yet.
>
> Signed-off-by: Eugene Batalov <ebatalov@xxxxxxxxxxxxx>
> ---
> I dont' use kernel percpu variables because for each SMP CPU
> their contents are copied from the bootstrap CPU. And I don't
> think that fixing the value for each SMP CPU is a good style.
> If you know a better approach to store the is_pv_clock_ready flags
> I am ready to use it.
>
> The patch applies cleanly to
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> I've tested the changes with Ubuntu 13.04 "raring" userspace and
> Ubuntu-3.8.0.19-30 kernel tag.
>
> arch/x86/kernel/kvmclock.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 5bedbdd..a6e0af4 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -43,6 +43,9 @@ early_param("no-kvmclock", parse_no_kvmclock);
> static struct pvclock_vsyscall_time_info *hv_clock;
> static struct pvclock_wall_clock wall_clock;
>
> +/* For each cpu store here a flag which tells whether pvclock is initialized */
> +static int __cacheline_aligned_in_smp is_pv_clock_ready[NR_CPUS] = {};
> +
> /*
> * The wallclock is the time of day when we booted. Since then, some time may
> * have elapsed since the hypervisor wrote the data. So we try to account for
> @@ -84,8 +87,11 @@ static cycle_t kvm_clock_read(void)
>
> preempt_disable_notrace();
> cpu = smp_processor_id();
> - src = &hv_clock[cpu].pvti;
> - ret = pvclock_clocksource_read(src);
> + if (is_pv_clock_ready[cpu]) {
> + src = &hv_clock[cpu].pvti;
> + ret = pvclock_clocksource_read(src);
> + } else
> + ret = 0ULL;
> preempt_enable_notrace();
> return ret;
> }
> @@ -168,6 +174,9 @@ int kvm_register_clock(char *txt)
> printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
> cpu, high, low, txt);
>
> + if (!ret)
> + is_pv_clock_ready[cpu] = 1;
> +
> return ret;
> }

The path can be really hot, so better avoid it if possible. The patch
to zero the memblock_alloc'd area from Igor brings the behaviour back
to before regression: return 0 until kvmclock is initialized. On top of
your analysis in the BZ, it now seems the right (and safer) thing to do.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/