Re: frequent lockups in 3.18rc4

From: Linus Torvalds
Date: Mon Dec 22 2014 - 14:47:45 EST


On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> This is *not* to say that this is the bug you're hitting. But it does show that
>
> (a) a flaky HPET can do some seriously bad stuff
> (b) the kernel is very fragile wrt time going backwards.
>
> and maybe we can use this test program to at least try to alleviate problem (b).

Ok, so after several false starts (ktime_get() is really really
fragile - called in scheduler things, and doing magic things at
bootup), here is something that seems to alleviate the problem for me.

I still get a lot of RCU messages like "self-detected stall" etc, but
that's to be expected. When the clock does odd things, crap *will*
happen.

But what this does is:

(a) make the error more visible as a clock error rather than various
random downstream users

IOW, it prints things out when it looks like we're getting odd
clock read errors (arbitrary cut-off: we expect clock read-outs to be
withing 1/8th of the range of the expected clock value)

(b) try to alleviate the horrible things that happen when the clock
error is big

The patch tries to "correct" for the huge time jump by basically
undoing it. We'll still see time jumps (there really is no way to
avoid it), but we limit the range of them.

With the attached patch, my machine seems to survive me writing to the
HPET master counter register. It spews warnings, and it is noisy about
the odd clock reads:

...
Clocksource hpet had cycles off by 642817751
Cutting it too close for hpet in in update_wall_time (offset = 4034102337)
INFO: rcu_sched self-detected stall on CPU { 0} (t=281743 jiffies
g=4722 c=4721 q=14)
...

and there may still be situations where it does horrible horrible
things due to the (smaller) time leaps, but it does seem a lot more
robust.

NOTE! There's an (intentional) difference in how we handle the time
leaps at time read time vs write (wall-clock update).

At time read time, we just refuse to believe the big delta, and we set
the "cycle_error" value so that future time reads will be relative to
the error we just got. We also don't print anything out, because we're
possibly deep in the scheduler or in tracing, and we do not want to
spam the console about our fixup.

At time *write* time, we first report about any read-time errors, and
then we report (but believe in) overlarge clocksource delta errors as
we update the time.

This seems to be the best way to limit the damage.

Also note that the patch is entirely clock-agnostic. It's just that I
can trivially screw up my HPET, I didn't look at other clocks.

One note: my current limit of clocksource delta errors is based on the
range of the clock (1/8th of the range). I actually think that's
bogus, and it should instead be based on the expected frequency of the
clock (ie "we are guaranteed to update the wall clock at least once
every second, so if the clock source delta read is larger than one
second, we've done something wrong"). So this patch is meant very much
as an RFC, rather than anything else. It's pretty hacky. But it does
actually make a huge difference for me wrt the "mess up HPET time on
purpose". That used to crash my machine pretty hard, and pretty
reliably. With this patch, I've done it ten+ times, and while it spews
a lot of garbage, the machine stays up and _works_.

Making the sanity check tighter (ie the "one second" band rather than
"1/8th of the clock range") would probably just improve it further.

Thomas, what do you think? Hate it? Any better ideas?

And again: this is not trying to make the kernel clock not jump. There
is no way I can come up with even in theory to try to really *fix* a
fundamentally broken clock.

So this is not meant to be a real "fix" for anything, but is meant to
make sure that if the clock is unreliable, we pinpoint the clock
itself, and it mitigates the absolutely horrendously bad behavior we
currently with bad clocks. So think of this as debug and band-aid
rather than "this makes clocks magically reliable".

.. and we might still lock up under some circumstances. But at least
from my limited testing, it is infinitely much better, even if it
might not be perfect. Also note that my "testing" has been writing
zero to the HPET lock (so the HPET clock difference tends to be pretty
specific), while my next step is to see what happens when I write
random values (and a lot of them).

Since I expect that to cause more problems, I thought I'd send this
RFC out before I start killing my machine again ;)

Linus
include/linux/timekeeper_internal.h | 1 +
kernel/time/timekeeping.c | 25 ++++++++++++++++++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 05af9a334893..0fcb60d77079 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -32,6 +32,7 @@ struct tk_read_base {
cycle_t (*read)(struct clocksource *cs);
cycle_t mask;
cycle_t cycle_last;
+ cycle_t cycle_error;
u32 mult;
u32 shift;
u64 xtime_nsec;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6a931852082f..1c842ddd567f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
tk->tkr.read = clock->read;
tk->tkr.mask = clock->mask;
tk->tkr.cycle_last = tk->tkr.read(clock);
+ tk->tkr.cycle_error = 0;

/* Do the ns -> cycle conversion first, using original mult */
tmp = NTP_INTERVAL_LENGTH;
@@ -197,11 +198,17 @@ static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
s64 nsec;

/* read clocksource: */
- cycle_now = tkr->read(tkr->clock);
+ cycle_now = tkr->read(tkr->clock) + tkr->cycle_error;

/* calculate the delta since the last update_wall_time: */
delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);

+ /* Hmm? This is really not good, we're too close to overflowing */
+ if (unlikely(delta > (tkr->mask >> 3))) {
+ tkr->cycle_error = delta;
+ delta = 0;
+ }
+
nsec = delta * tkr->mult + tkr->xtime_nsec;
nsec >>= tkr->shift;

@@ -465,6 +472,16 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
update_fast_timekeeper(tk);
}

+static void check_cycle_error(struct tk_read_base *tkr)
+{
+ cycle_t error = tkr->cycle_error;
+
+ if (unlikely(error)) {
+ tkr->cycle_error = 0;
+ pr_err("Clocksource %s had cycles off by %llu\n", tkr->clock->name, error);
+ }
+}
+
/**
* timekeeping_forward_now - update clock to the current time
*
@@ -481,6 +498,7 @@ static void timekeeping_forward_now(struct timekeeper *tk)
cycle_now = tk->tkr.read(clock);
delta = clocksource_delta(cycle_now, tk->tkr.cycle_last, tk->tkr.mask);
tk->tkr.cycle_last = cycle_now;
+ check_cycle_error(&tk->tkr);

tk->tkr.xtime_nsec += delta * tk->tkr.mult;

@@ -1237,6 +1255,7 @@ static void timekeeping_resume(void)

/* Re-base the last cycle value */
tk->tkr.cycle_last = cycle_now;
+ tk->tkr.cycle_error = 0;
tk->ntp_error = 0;
timekeeping_suspended = 0;
timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
@@ -1591,11 +1610,15 @@ void update_wall_time(void)
if (unlikely(timekeeping_suspended))
goto out;

+ check_cycle_error(&real_tk->tkr);
+
#ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
offset = real_tk->cycle_interval;
#else
offset = clocksource_delta(tk->tkr.read(tk->tkr.clock),
tk->tkr.cycle_last, tk->tkr.mask);
+ if (unlikely(offset > (tk->tkr.mask >> 3)))
+ pr_err("Cutting it too close for %s in in update_wall_time (offset = %llu)\n", tk->tkr.clock->name, offset);
#endif

/* Check if there's really nothing to do */