Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog temporarily when high read latency detected

From: Paul E. McKenney
Date: Wed Jan 11 2023 - 12:51:10 EST


On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote:
> > This can be reproduced by running memory intensive 'stream' tests,
> > or some of the stress-ng subcases such as 'ioport'.
> >
> > The reason for these issues is the when system is under heavy load, the
> > read latency of the clocksources can be very high. Even lightweight TSC
> > reads can show high latencies, and latencies are much worse for external
> > clocksources such as HPET or the APIC PM timer. These latencies can
> > result in false-positive clocksource-unstable determinations.
> >
> > Given that the clocksource watchdog is a continual diagnostic check with
> > frequency of twice a second, there is no need to rush it when the system
> > is under heavy load. Therefore, when high clocksource read latencies
> > are detected, suspend the watchdog timer for 5 minutes.
>
> We should have enough heuristics in place by now to qualify the output of
> the clocksource watchdog as a random number generator, right?

Glad to see that you are still keeping up your style, Thomas! ;-)

We really do see the occasional clocksource skew in our fleet, and
sometimes it really is the TSC that is in disagreement with atomic-clock
time. And the watchdog does detect these, for example, the 40,000
parts-per-million case discussed recently. We therefore need a way to
check this, but without producing false positives on busy systems and
without the current kneejerk reflex of disabling TSC, thus rendering the
system useless from a performance standpoint for some important workloads.

Yes, if a system was 100% busy forever, this patch would suppress these
checks. But 100% busy forever is not the common case, due to thermal
throttling and to security updates if nothing else.

With all that said, is there a better way to get the desired effects of
this patch?

Thanx, Paul