[PATCHv2] timekeeping: Fix clock stability with nohz

From: Miroslav Lichvar
Date: Thu Nov 14 2013 - 05:06:02 EST


Since commit 5eb6d205, NTP corrections are accumulated independently
from the system clock and the clock multiplier is adjusted in a feedback
loop according to the current error between the system time and NTP
time.

This works well when the multiplier is updated often and regularly. With
nohz and idle system, however, the update interval is so long that the
loop becomes unstable. The frequency of the clock doesn't settle down
and there is a large time error, which seems to grow quadratically with
update interval.

If the constants used in the loop were modified for a maximum update
interval, it would be stable, but too slow to keep up with NTP. Without
knowing when will be the next update it's difficult to implement a loop
that is both fast and stable.

This patch fixes the problem by postponing update of NTP tick length in
the clock and setting the multiplier directly without feedback loop by
dividing the tick length with clock cycles. Previously, a change in tick
length was applied immediately to all ticks since last update, which
caused a jump in the NTP error proportional to the change and the update
interval and which had to be corrected later by the loop.

The only source of the accounted NTP error is now lack of resolution in
the clock multiplier. The error is corrected by adding 0 or 1 to the
calculated multiplier. With a constant tick length, the multiplier will
be switching between the two values. The loop is greatly simplified and
there is no problem with stability. The maximum error is limited by the
update interval.

In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
update the maximum error went down from 480 microseconds to 55
nanoseconds.

In a real test on idle machine comparing raw TSC and clock_gettime()
time stamps, the maximum error went down from microseconds to tens of
nanoseconds. A test with clock synchronized to a PTP hardware clock by
phc2sys from linuxptp now shows no difference when running with nohz
enabled and disabled, the clock seems to be stable to few tens of
nanoseconds.

Signed-off-by: Miroslav Lichvar <mlichvar@xxxxxxxxxx>
---
include/linux/timekeeper_internal.h | 6 ++
kernel/time/timekeeping.c | 156 ++++++++++++------------------------
2 files changed, 58 insertions(+), 104 deletions(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index c1825eb..d55a12a 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -34,12 +34,18 @@ struct timekeeper {
/* Clock shifted nano seconds */
u64 xtime_nsec;

+ /* Tick length used in calculation of NTP error. */
+ u64 ntp_tick;
/* Difference between accumulated time and NTP time in ntp
* shifted nano seconds. */
s64 ntp_error;
/* Shift conversion between clock shifted nano seconds and
* ntp shifted nano seconds. */
u32 ntp_error_shift;
+ /* Correction applied to mult to reduce the error. */
+ s32 mult_ntp_correction;
+ /* Flag used to avoid updating NTP twice with same second */
+ u32 skip_second_overflow;

/*
* wall_to_monotonic is what we need to add to xtime (or xtime corrected
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0aa4ce8..b94f1df 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -147,6 +147,10 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
* to counteract clock drifting.
*/
tk->mult = clock->mult;
+
+ tk->ntp_tick = ntpinterval << tk->ntp_error_shift;
+ tk->mult_ntp_correction = 0;
+ tk->skip_second_overflow = 0;
}

/* Timekeeper helper functions. */
@@ -1052,106 +1056,49 @@ static int __init timekeeping_init_ops(void)
device_initcall(timekeeping_init_ops);

/*
- * If the error is already larger, we look ahead even further
- * to compensate for late or lost adjustments.
- */
-static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
- s64 error, s64 *interval,
- s64 *offset)
-{
- s64 tick_error, i;
- u32 look_ahead, adj;
- s32 error2, mult;
-
- /*
- * Use the current error value to determine how much to look ahead.
- * The larger the error the slower we adjust for it to avoid problems
- * with losing too many ticks, otherwise we would overadjust and
- * produce an even larger error. The smaller the adjustment the
- * faster we try to adjust for it, as lost ticks can do less harm
- * here. This is tuned so that an error of about 1 msec is adjusted
- * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
- */
- error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
- error2 = abs(error2);
- for (look_ahead = 0; error2 > 0; look_ahead++)
- error2 >>= 2;
-
- /*
- * Now calculate the error in (1 << look_ahead) ticks, but first
- * remove the single look ahead already included in the error.
- */
- tick_error = ntp_tick_length() >> (tk->ntp_error_shift + 1);
- tick_error -= tk->xtime_interval >> 1;
- error = ((error - tick_error) >> look_ahead) + tick_error;
-
- /* Finally calculate the adjustment shift value. */
- i = *interval;
- mult = 1;
- if (error < 0) {
- error = -error;
- *interval = -*interval;
- *offset = -*offset;
- mult = -1;
- }
- for (adj = 0; error > i; adj++)
- error >>= 1;
-
- *interval <<= adj;
- *offset <<= adj;
- return mult << adj;
-}
-
-/*
* Adjust the multiplier to reduce the error value,
* this is optimized for the most common adjustments of -1,0,1,
* for other values we can do a bit more work.
*/
static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
{
- s64 error, interval = tk->cycle_interval;
+ s64 interval = tk->cycle_interval;
+ u32 mult;
int adj;

/*
- * The point of this is to check if the error is greater than half
- * an interval.
- *
- * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs.
- *
- * Note we subtract one in the shift, so that error is really error*2.
- * This "saves" dividing(shifting) interval twice, but keeps the
- * (error > interval) comparison as still measuring if error is
- * larger than half an interval.
- *
- * Note: It does not "save" on aggravation when reading the code.
+ * Get multiplier matching the current NTP tick length. Use division
+ * only when the tick length has changed (normally once per second
+ * with active NTP PLL).
*/
- error = tk->ntp_error >> (tk->ntp_error_shift - 1);
- if (error > interval) {
- /*
- * We now divide error by 4(via shift), which checks if
- * the error is greater than twice the interval.
- * If it is greater, we need a bigadjust, if its smaller,
- * we can adjust by 1.
- */
- error >>= 2;
- if (likely(error <= interval))
- adj = 1;
- else
- adj = timekeeping_bigadjust(tk, error, &interval, &offset);
+ if (tk->ntp_tick == ntp_tick_length()) {
+ mult = tk->mult - tk->mult_ntp_correction;
} else {
- if (error < -interval) {
- /* See comment above, this is just switched for the negative */
- error >>= 2;
- if (likely(error >= -interval)) {
- adj = -1;
- interval = -interval;
- offset = -offset;
- } else {
- adj = timekeeping_bigadjust(tk, error, &interval, &offset);
- }
- } else {
- goto out_adjust;
- }
+ tk->ntp_tick = ntp_tick_length();
+ mult = div64_u64((tk->ntp_tick >> tk->ntp_error_shift) -
+ tk->xtime_remainder, tk->cycle_interval);
+ }
+
+ /*
+ * Speed the clock up by 1 if it's behind NTP time. If there is no
+ * remainder from the tick division, the clock will stay ahead
+ * of NTP time until a non-divisible tick length is encountered.
+ * (slowing the clock by 1 would increase the frequency error)
+ */
+ tk->mult_ntp_correction = tk->ntp_error > 0 ? 1 : 0;
+ mult += tk->mult_ntp_correction;
+ adj = mult - tk->mult;
+
+ if (adj == 0)
+ goto out_adjust;
+
+ /* Multiply offset and interval by adj */
+ if (adj == -1) {
+ interval = -interval;
+ offset = -offset;
+ } else if (adj != 1) {
+ interval *= adj;
+ offset *= adj;
}

if (unlikely(tk->clock->maxadj &&
@@ -1207,13 +1154,10 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
* xtime_nsec_2 = xtime_nsec_1 - offset
* Which simplfies to:
* xtime_nsec -= offset
- *
- * XXX - TODO: Doc ntp_error calculation.
*/
tk->mult += adj;
tk->xtime_interval += interval;
tk->xtime_nsec -= offset;
- tk->ntp_error -= (interval - offset) << tk->ntp_error_shift;

out_adjust:
/*
@@ -1222,18 +1166,13 @@ out_adjust:
* in the code above, its possible the required corrective factor to
* xtime_nsec could cause it to underflow.
*
- * Now, since we already accumulated the second, cannot simply roll
- * the accumulated second back, since the NTP subsystem has been
- * notified via second_overflow. So instead we push xtime_nsec forward
- * by the amount we underflowed, and add that amount into the error.
- *
- * We'll correct this error next time through this function, when
- * xtime_nsec is not as small.
+ * Now, since we already accumulated the second and the NTP subsystem has
+ * been notified via second_overflow(), we need to skip the next update.
*/
if (unlikely((s64)tk->xtime_nsec < 0)) {
- s64 neg = -(s64)tk->xtime_nsec;
- tk->xtime_nsec = 0;
- tk->ntp_error += neg << tk->ntp_error_shift;
+ tk->xtime_nsec += (u64)NSEC_PER_SEC << tk->shift;
+ tk->xtime_sec--;
+ tk->skip_second_overflow = 1;
}

}
@@ -1257,6 +1196,15 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
tk->xtime_nsec -= nsecps;
tk->xtime_sec++;

+ /*
+ * Skip NTP update if this second was accumulated before,
+ * i.e. xtime_nsec underflowed in timekeeping_adjust()
+ */
+ if (unlikely(tk->skip_second_overflow)) {
+ tk->skip_second_overflow = 0;
+ continue;
+ }
+
/* Figure out if its a leap sec and apply if needed */
leap = second_overflow(tk->xtime_sec);
if (unlikely(leap)) {
@@ -1315,7 +1263,7 @@ static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
tk->raw_time.tv_nsec = raw_nsecs;

/* Accumulate error between NTP and clock interval */
- tk->ntp_error += ntp_tick_length() << shift;
+ tk->ntp_error += tk->ntp_tick << shift;
tk->ntp_error -= (tk->xtime_interval + tk->xtime_remainder) <<
(tk->ntp_error_shift + shift);

@@ -1401,7 +1349,7 @@ void update_wall_time(void)
shift--;
}

- /* correct the clock when NTP error is too big */
+ /* Update the multiplier */
timekeeping_adjust(tk, offset);

/*
--
1.8.4.2


--h31gzZEtNLTqOjlF--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/