Re: [PATCH clocksource] Reject bogus watchdog clocksource measurements

From: Feng Tang
Date: Tue Nov 01 2022 - 22:58:59 EST


On Tue, Nov 01, 2022 at 12:06:27PM -0700, Paul E. McKenney wrote:
> On Tue, Nov 01, 2022 at 01:43:32PM +0800, Feng Tang wrote:
> > On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:
> >
> > [...]
> > > > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > continue;
> > > > > }
> > > > > if (wd_nsec > (wdi << 2)) {
> > > >
> > > > Just recalled one thing, that it may be better to check 'cs_nsec'
> > > > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > > > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > > > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > > > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > > > the watchdog's value could 'overflow' many times.
> > > >
> > > > And usually the 'current' closcksource has longer wrap time than
> > > > the watchdog.
> > >
> > > Why not both?
> >
> > You mean checking both clocksource and the watchdog? It's fine for
> > me, though I still trust clocksource more.
>
> OK, good, I will check both. You never know what future hardware
> might bring.

Makes sense to me.

> I also reversed the order of the checks, so that it now checks for heavy
> load before too-short interval. The purpose is to automatically avoid
> being fooled by clock wrap.
>
> > I checked some old emails and found some long stall logs for reference.
> >
> > * one stall of 471 seconds
> >
> > [ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
> > [ 2410.706920] clocksource: 'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
> > [ 2410.718583] clocksource: 'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
> > [ 2410.732568] clocksource: 'tsc' is current clocksource.
> > [ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
> > [ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > [ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
> > [ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
> > [ 2410.784045] clocksource: Switched to clocksource hpet
> >
> >
> > * another one of 5 seconds
> >
> > [ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
> > [ 3302.211710] clocksource: 'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
> > [ 3302.211712] clocksource: 'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
> > [ 3302.211714] clocksource: 'tsc' is current clocksource.
> > [ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog
>
> Very good, thank you! I believe that both of these would be handled
> by the updated commit (see below for the update).

Yes, I think so too.

>
> > > if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> > >
> > > > > - /* This can happen on busy systems, which can delay the watchdog. */
> > > > > - pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > > + bool needwarn = false;
> > > > > + u64 wd_lb;
> > > > > +
> > > > > + cs->wd_bogus_count++;
> > > > > + if (!cs->wd_bogus_shift) {
> > > > > + needwarn = true;
> > > > > + } else {
> > > > > + delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > > > + wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > > > + if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > > > + needwarn = true;
> > > >
> > > > I'm not sure if we need to check the last_bogus counter, or just
> > > > the current interval 'cs_nsec' is what we care, and some code
> > > > like this ?
> > >
> > > I thought we wanted exponential backoff? Do you really get that from
> > > the changes below?
> >
> > Aha, I misunderstood your words. I thought to only report one time for
> > each 2, 4, 8, ... 256 seconds stall, and after that only report stall
> > of 512+ seconds. So your approach looks good to me, as our intention is
> > to avoid the flood of warning message.
>
> Sounds good, thank you!
>
> Please see below for a patch to be squashed into the original.
>
> Thoughts?

It looks good to me, thanks!

- Feng

>
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit eaee921daa7091f0eb731c9217ccc638ed5f8baf
> Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> Date: Tue Nov 1 12:02:18 2022 -0700
>
> squash! clocksource: Exponential backoff for load-induced bogus watchdog reads
>
> [ paulmck: Apply Feng Tang feedback. ]
>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 6537ffa02e445..de8047b6720f5 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -442,12 +442,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>
> /* Check for bogus measurements. */
> wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> - if (wd_nsec < (wdi >> 2)) {
> - /* This usually indicates broken timer code or hardware. */
> - pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> - continue;
> - }
> - if (wd_nsec > (wdi << 2)) {
> + if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> bool needwarn = false;
> u64 wd_lb;
>
> @@ -470,6 +465,12 @@ static void clocksource_watchdog(struct timer_list *unused)
> }
> continue;
> }
> + /* Check too-short measurements second to handle wrap. */
> + if (wd_nsec < (wdi >> 2) || cs_nsec < (wdi >> 2)) {
> + /* This usually indicates broken timer code or hardware. */
> + pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> + continue;
> + }
>
> /* Check the deviation from the watchdog clocksource. */
> md = cs->uncertainty_margin + watchdog->uncertainty_margin;