Re: High rate of touch_softlockup makes Soft Lockup detector useless

From: Joel Fernandes
Date: Thu Jul 07 2016 - 19:47:36 EST


On Thu, Jul 7, 2016 at 4:06 PM, Joel Fernandes <agnel.joel@xxxxxxxxx> wrote:
>>> Digging in further, I found that the softlockup watchdog is touched
>>> 1000s of times per second by the NOHZ code.
>>> prints revealed the following 2 functions calling touch_softlockup_watchdog:
>>> [ 165.960292] CPU0 touch: tick_nohz_restart_sched_tick
>>> [ 165.960309] CPU1 touch: tick_nohz_update_jiffies
>>>
>>> I am wondering, do we really need to touch the softlockup watchdog
>>> from the tick_nohz code?
>>> From the code comments it looks like the watchdog is touch'ed because
>>> the tick was off and was being turned on so it could the watchdog may
>>> not have been touched for a long time.
>>> BUT, wouldn't the hrtimer interrupt for the watchdog timer cause the
>>> watchdog thread to be scheduled even though the tick was off for a
>>> long time? Then in that case do we really need to touch the softlockup
>>> watchdog from the tick_nohz code?
>>
>> Yes, it will be scheduled, but it might be too late. Assume the following:
>>
>> t1 hrtimer fires
>> watchdog thread runs
>> watchdog timer is rearmed to t2 = t1 + period
>>
>> idle sleep
>>
>> t2 - 1ms long running thread gets scheduled
>>
>> t2 hrtimer fires
>>
>> long running thread stops
>>
>> watchdog thread runs and detects soft lockup
>>
>> The soft lockup detector checks whether the CPU is hogged by some random
>> task. It does so by monitoring whether the watchdog task which is peridocially
>> scheduled by a hrtimer becomes running before the watchdog period elapses.
>>
>> If the cpu goes idle then nothing hogs the cpu and the check period can be
>> canceled.
>
> That makes sense, thanks for explaining. I found out my problem was
> because of occasional serial console prints resetting the watchdogs.
> In drivers/tty/serial/8250/8250_port.c touch_nmi_watchdog() is being
> called. Disabling serial console makes the softlockup and hardlockup
> detectors work again for me.

Just clarifying the tick_nohz functions calling touch_nmi_watchdog
were happening in context of idle thread so those were not the problem
and kind of mislead me into thinking it was. They were happening on
other CPUs (other than the one that was locked up). The serial console
messages were actually what was causing the watchdogs to get reset on
the locked up CPU for me.

Thanks,
Joel