Re: [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups

From: Thomas Gleixner
Date: Thu Jan 16 2020 - 08:39:42 EST


Petr,

Petr Mladek <pmladek@xxxxxxxx> writes:
> - if (touch_ts == 0) {
> + /* Was the watchdog touched externally by a known slow code? */
> + if (period_ts == 0) {
> if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
> /*
> * If the time stamp was touched atomically
> @@ -394,7 +405,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>
> /* Clear the guest paused flag on watchdog reset */
> kvm_check_and_clear_guest_paused();
> - __touch_watchdog();
> + /*
> + * The above kvm*() function could touch the watchdog.
> + * Set the real timestamp later to avoid an infinite
> loop.

This comment makes no sense whatsoever. If period_ts is 0,
i.e. something invoked touch_softlockup_watchdog*() then it does not
make any difference whether the kvm function invokes one of those
functions once more. The result is the same. The per cpu period_ts is
still 0.

The point is that _AFTER_ a intentional watchdog reset, the reporting
base time needs to be set to now() in order to make it functional again.

> + */
> + reset_report_period_ts();

Btw, the function name is misleading. I got confused several times
because I expected the reset to set the timestamp to 0, which is not the
case. update_report_period_ts() is far more intuitive.

> @@ -404,8 +420,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> * indicate it is getting cpu time. If it hasn't then
> * this is a good indication some task is hogging the cpu
> */
> - duration = is_softlockup(touch_ts);
> + duration = is_softlockup(touch_ts, period_ts);
> if (unlikely(duration)) {

This lacks a comment. Your changelog paragraph:

> Also the timestamp should get reset explicitly. It is done also by the code
> printing the backtrace. But it is just a side effect and far from
> obvious.

is probably referring to this, but it confused me more than it helped
simply because the update of the timestamp happens unconditionally even
when the backtrace code is not reached due to the KVM check

So this is a change vs. the current implementation which is not
documented and explained.

> + reset_report_period_ts();
> /*
> * If a virtual machine is stopped by the host it can look to
> * the watchdog like a soft lockup, check to see if the host

Thanks,

tglx