Everything you want to know about time (Was: Cyrix 6x86MX and Centaur C6 CPUs in 2.1.102)

C. Scott Ananian (cananian@lcs.mit.edu)
Wed, 20 May 1998 20:08:08 -0400 (EDT)


My apologies for the length.

On Wed, 20 May 1998 André Derrick Balsa <andrebalsa@altern.org> wrote:

> e) Another thing that I would like to see was recently suggested by
> Vojtech Pavlik :
>
> Could we have a single calibration of the TSC done at boot time? For
> some reason I cannot understand, the TSC gets re-calibrated 100
> times/second in time.c !?!?! If there is a real need for this frequent
> recalibration, perhaps we could have the value exported so other CPU
> drivers can make use of it?

Well, the above is not *precisely* correct.

The calibration is done using an *average* (because jiffy-to-jiffy
interrupt latency may be variable? Just a guess). As time goes on, the
'frequent recalibration' actually becomes more accurate. Or is supposed
to. More on that later. To quote the kernel source for the recalibration
(linux/arch/i386/kernel/time.c):
/*
* Divide the 64-bit time with the 32-bit jiffy counter,
* getting the quotient in clocks.
*
* Giving quotient = "1/(average internal clocks per usec)*2^32"
* we do this '1/...' trick to get the 'mull' into the critical
* path. 'mull' is much faster than divl (10 vs. 41 clocks)
*/
Note this is 64-bit time in cycles since init (rtsc at init is subtracted
from rtsc at last jiffy interrupt), and the jiffy counter counts timer
ticks since init. So the "average internal clocks per usec" is a running
average since when the machine first booted. [This also eliminates the
jitter that may happen at init.] This is done in do_fast_gettimeoffset, so
if nobody's calling gettimeofday, then this will not be recalculated. So
it's 100 times a second *max* (worst case once per jiffy, the first time
in each jiffy that gettimeofday is called).

*However*, there are many cases where it is not reasonable to assume that
"average internal clocks per usec" is a constant. Haltable CPUs are one
example, as are APM-enabled machines that slow the processor clock to save
power. And let's not forget those ol' machines with the 'turbo' button on
the front. Perhaps an average over the last N jiffies is more
appropriate....

...but it doesn't really matter, because the worse that will happen if
your clock speed changes (or even stops between interrupts) is that the
intra-jiffy timing will be off---gettimeofday will estimate the wrong time
based on the TSC count.

THE EVIL OOPS POTENTIAL: (this is what really matters)
~~~~~~~~~~~~~~~~~~~~~~~~
BUT some machines do the unthinkable -- they actually randomly destroy the
TSC value during "power-saving." This is the Cyrix bug. This also occurs
during APM suspend: only the low 32 bits of the TSC can be restored after
you power off the processor; the high 32 bits are zeroed. [The Centaur
shouldn't Oops, as it doesn't destroy the TSC, it just stops it. This
leads to a more subtle (but non-catastrophic) problem, discussed later.]
It took me some thinking to figure out exactly how destroying the TSC
causes the oops, so I'll recap the process for you. (It seems obvious in
retrospect; the hard part was decoding the asm. Brainy people can read
linux/arch/i386/kernel/time.c, understand it immediately, and then skip
ahead.)

First of all, the calculation that do_fast_gettimeoffset performs is:
(rewritten to make the divisions obvious and to eliminate inline assembly)

eax = (last_timer - init_timer) / jiffies;
tmp = eax;
quotient = ( USECS_PER_JIFFY * (2^32) ) / tmp;

where last_timer is the TSC value at the last timer interrupt/jiffy,
init_timer is the TSC value at init (something close to zero; it's
initialized in time_init(), which is called from init/main.c), and jiffies
is the total number of timer interrupts/jiffies since boot
(interrupts are enabled right before time_init() is called).
USECS_PER_JIFFY what it says it is, calculated from what we know about the
timer frequency.

So, the danger is that when the TSC is trashed (high 32 bits are zeros),
last_timer - init_timer suddenly becomes very small (even negative!), and
(last_timer - init_timer) / jiffies;
is likely to be zero. When it *is* zero, calculating 'quotient':
( USECS_PER_JIFFY * (2^32) ) / ((last_timer-init_timer)/jiffies);
will result in a divide-by-zero error: a kernel Oops.
[As these are unsigned divides, we have *real fun* when (last_timer -
init_timer) is negative.]

So, there are two issues here. First off, we should definitely not use
this routine if the TSC is likely to be trashed for any reason: we'll get
bad time values, and the occasional Oops. (Note that only intra-jiffy
times are affected; read the source for the details on how the error is
bounded). ['quotient' for a trashed TSC will either be very large or very
small, leading to intra-jiffy times that either race ahead or lag behind]

Second, is a cumulative average really the way to go here? For
fixed-clock machines, undoubtedly. But it may be worthwhile using
slow_gettimeoffset if your clock speed can change for any reason: turbo
button, apm, or any chip that halts the TSC at any time (to save power, or
whatever). Otherwise you can just kiss the microsecond timing of your
favorite OS good-bye.

We certainly don't want to replace this with a single-shot calibration:
too many variable latencies. We want *some* sort of average. What's the
right time scale?

Lastly, exporting the real CPU cycle time to /proc/cpuinfo. Not a
difficult thing: we've got init_timer_cc, last_timer_cc, and jiffies in
memory -- not to mention cached_quotient sitting around. The hairy thing
is that this information is probably not available on all architectures,
and exporting the relevant data in a architecture-independent manner will
probably be hairy. Not impossible, but it will have to be done carefully
and cleanly if it's going to make it past Linus. Remember that all the
reasons for not using do_fast_gettimeoffset are also going to be reasons
why measuring your CPU clock rate this way isn't going to work. Probably
best to stick with BogoMIPS. The attached user-mode program will
calculate 'real' CPU clock speed for you in a much nicer (read,
non-kernel) manner.
--Scott
@ @
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-oOO-(_)-OOo-=-=-=-=-=
C. Scott Ananian: cananian@lcs.mit.edu / Declare the Truth boldly and
Laboratory for Computer Science/Crypto / without hindrance.
Massachusetts Institute of Technology /META-PARRESIAS AKOLUTOS:Acts 28:31
-.-. .-.. .. ..-. ..-. --- .-. -.. ... -.-. --- - - .- -. .- -. .. .- -.
PGP key available via finger and from http://www.pdos.lcs.mit.edu/~cananian

---------------- cut here -----------------------------
/* Contributed by an anonymous internet hero */
#include <stdio.h>
#include <sys/time.h>

/* returns number of clock cycles since last reboot */
__inline__ unsigned long long int rdtsc()
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}

main ()
{
struct timezone tz;
struct timeval tvstart, tvstop;
unsigned long long int cycles[2]; /* gotta be 64 bit */
unsigned int microseconds; /* total time taken */

bzero(&tz, sizeof(tz));

/* get this function in cached memory */
gettimeofday(&tvstart, &tz);
cycles[0] = rdtsc();
gettimeofday(&tvstart, &tz);

/* we don't trust that this is any specific length of time */
sleep(1);

cycles[1] = rdtsc();
gettimeofday(&tvstop, &tz);
microseconds = ((tvstop.tv_sec-tvstart.tv_sec)*1000000) +
(tvstop.tv_usec-tvstart.tv_usec);

printf("%f MHz processor.\n",
(float)(cycles[1]-cycles[0])/microseconds);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu