Re: Kernel hangs in SMP + VMware environment.

From: Alok Kataria
Date: Wed May 14 2008 - 14:31:19 EST


On Wed, May 14, 2008 at 4:00 AM, Tetsuo Handa
<penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> Hello.
>
> Roland wrote:
>> maybe related to http://bugzilla.kernel.org/show_bug.cgi?id=9834 ?
> Thank you for URL.
> My bug seems to be timer related.
>
>> you say "recent" , so this does happen from 2.6.21 to 2.6.26rc2 ?
> I don't know exact version, but I don't experience this problem
> in earlier kernels (e.g. Fedora Core 5 which uses kernel 2.6.20).
>
>> does that happen only on a dedicated vmware box, or on different
ones?
>> vmware-tools active? ->stop -> different ?
> vmware-tools is not installed for my Fedora 8.
>
>> could you provide some more information about your hardware/vmware
>> environment ?
>> does that happen on esx or on hosted products (workstation, server,
>> player..) ?
>
> Hardware: ThinkPad X60 (Intel Core 2 Duo, 2048MB RAM, No swap
partition)
> VMware host environment: CentOS 5.1 (x86_64)
> VMware version: VMware Workstation 6.0.2 (x86_64)
> VMware guest environment: many distro using recent kernels (all i386)

Hi Tetsuo,

Can you try the patch attached with this mail, I made this on top of
2.6.24.7 but should fit on any other 2.6.24 based distro kernel.

If the attached patch still gives you the same problem, please send me
your config file and the boot time dmesg's.

Thanks,
Alok
This is a backport of the patch submitted by Thomas Gleixner to the
x86 git tree, commit d8bb6f4c1670c8324e4135c61ef07486f7f17379

Comments from the original post :

We already catch most of the TSC problems by sanity checks, but there
is a subtle bug which has been in the code for ever. This can cause
time jumps in the range of hours.

This was reported in:
http://lkml.org/lkml/2007/8/23/96
and
http://lkml.org/lkml/2008/3/31/23

I was able to reproduce the problem with a gettimeofday loop test on a
dual core and a quad core machine which both have sychronized
TSCs. The TSCs seems not to be perfectly in sync though, but the
kernel is not able to detect the slight delta in the sync check. Still
there exists an extremly small window where this delta can be observed
with a real big time jump. So far I was only able to reproduce this
with the vsyscall gettimeofday implementation, but in theory this
might be observable with the syscall based version as well.

CPU 0 updates the clock source variables under xtime/vyscall lock and
CPU1, where the TSC is slighty behind CPU0, is reading the time right
after the seqlock was unlocked.

The clocksource reference data was updated with the TSC from CPU0 and
the value which is read from TSC on CPU1 is less than the reference
data. This results in a huge delta value due to the unsigned
subtraction of the TSC value and the reference value. This algorithm
can not be changed due to the support of wrapping clock sources like
pm timer.

The huge delta is converted to nanoseconds and added to xtime, which
is then observable by the caller. The next gettimeofday call on CPU1
will show the correct time again as now the TSC has advanced above the
reference value.

To prevent this TSC specific wreckage we need to compare the TSC value
against the reference value and return the latter when it is larger
than the actual TSC value.

I pondered to mark the TSC unstable when the readout is smaller than
the reference value, but this would render an otherwise good and fast
clocksource unusable without a real good reason.

Signed-off-by: Alok Kataria <akataria@xxxxxxxxxx>
CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>

Index: linux-2.6.24.7/arch/x86/kernel/tsc_32.c
===================================================================
--- linux-2.6.24.7.orig/arch/x86/kernel/tsc_32.c 2008-05-14 11:25:20.000000000 -0700
+++ linux-2.6.24.7/arch/x86/kernel/tsc_32.c 2008-05-14 11:26:21.000000000 -0700
@@ -268,14 +268,27 @@
/* clock source code */

static unsigned long current_tsc_khz = 0;
+static struct clocksource clocksource_tsc;

+/*
+ * We compare the TSC to the cycle_last value in the clocksource
+ * structure to avoid a nasty time-warp issue. This can be observed in
+ * a very small window right after one CPU updated cycle_last under
+ * xtime lock and the other CPU reads a TSC value which is smaller
+ * than the cycle_last reference value due to a TSC which is slighty
+ * behind. This delta is nowhere else observable, but in that case it
+ * results in a forward time jump in the range of hours due to the
+ * unsigned delta calculation of the time keeping core code, which is
+ * necessary to support wrapping clocksources like pm timer.
+ */
static cycle_t read_tsc(void)
{
cycle_t ret;

rdtscll(ret);

- return ret;
+ return ret >= clocksource_tsc.cycle_last ?
+ ret : clocksource_tsc.cycle_last;
}

static struct clocksource clocksource_tsc = {
Index: linux-2.6.24.7/arch/x86/kernel/tsc_64.c
===================================================================
--- linux-2.6.24.7.orig/arch/x86/kernel/tsc_64.c 2008-05-14 11:25:20.000000000 -0700
+++ linux-2.6.24.7/arch/x86/kernel/tsc_64.c 2008-05-14 11:26:21.000000000 -0700
@@ -10,6 +10,7 @@

#include <asm/hpet.h>
#include <asm/timex.h>
+#include <asm/vgtod.h>

static int notsc __initdata = 0;

@@ -246,18 +247,34 @@

__setup("notsc", notsc_setup);

+static struct clocksource clocksource_tsc;

-/* clock source code: */
+/*
+ * We compare the TSC to the cycle_last value in the clocksource
+ * structure to avoid a nasty time-warp. This can be observed in a
+ * very small window right after one CPU updated cycle_last under
+ * xtime/vsyscall_gtod lock and the other CPU reads a TSC value which
+ * is smaller than the cycle_last reference value due to a TSC which
+ * is slighty behind. This delta is nowhere else observable, but in
+ * that case it results in a forward time jump in the range of hours
+ * due to the unsigned delta calculation of the time keeping core
+ * code, which is necessary to support wrapping clocksources like pm
+ * timer.
+ */
static cycle_t read_tsc(void)
{
cycle_t ret = (cycle_t)get_cycles_sync();
- return ret;
+
+ return ret >= clocksource_tsc.cycle_last ?
+ ret : clocksource_tsc.cycle_last;
}

static cycle_t __vsyscall_fn vread_tsc(void)
{
cycle_t ret = (cycle_t)get_cycles_sync();
- return ret;
+
+ return ret >= __vsyscall_gtod_data.clock.cycle_last ?
+ ret : __vsyscall_gtod_data.clock.cycle_last;
}

static struct clocksource clocksource_tsc = {
Index: linux-2.6.24.7/kernel/time/timekeeping.c
===================================================================
--- linux-2.6.24.7.orig/kernel/time/timekeeping.c 2008-05-14 11:25:20.000000000 -0700
+++ linux-2.6.24.7/kernel/time/timekeeping.c 2008-05-14 11:26:21.000000000 -0700
@@ -189,6 +189,7 @@
if (clock == new)
return;

+ new->cycle_last = 0;
now = clocksource_read(new);
nsec = __get_nsec_offset();
timespec_add_ns(&xtime, nsec);
@@ -301,6 +302,7 @@
/* Make sure that we have the correct xtime reference */
timespec_add_ns(&xtime, timekeeping_suspend_nsecs);
/* re-base the last cycle value */
+ clock->cycle_last = 0;
clock->cycle_last = clocksource_read(clock);
clock->error = 0;
timekeeping_suspended = 0;