Re: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

From: Weiping Zhang
Date: Tue Jan 31 2023 - 02:56:41 EST


On Mon, Jan 30, 2023 at 6:28 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Sun 2023-01-29 17:08:29, Weiping Zhang wrote:
> > Sorry for the late reply.
> >
> > On Fri, Jan 13, 2023 at 9:43 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> > >
> > > On Thu 2023-01-12 17:17:45, Weiping Zhang wrote:
> > > > This patch try to add more debug info to detect lost kernel log or no
> > > > hung task was detected.
> > > >
> > > > The user set 10 to the hung_task_timeout_secs, the kernel log:
> > > >
> > > > [ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
> > > > [ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
> > > > [ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
> > > > [ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.
> >
> > Expect but no hung task log at this moment.
> >
> > > > [ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.
> > > >
> > > > There is no any log at about 3983, it's hard to know if kernel log was
> > > > lost or there is no hung task was detected at that moment. So this patch
> > > > print sysctl_hung_task_warnings to distinguish the above two cases.
> > > >
> > > > Signed-off-by: Weiping Zhang <zhangweiping@xxxxxxxxxxxxxx>
> > > > ---
> > > > kernel/hung_task.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > > index c71889f3f3fc..ca917931473d 100644
> > > > --- a/kernel/hung_task.c
> > > > +++ b/kernel/hung_task.c
> > > > @@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > > > * complain:
> > > > */
> > > > if (sysctl_hung_task_warnings) {
> > > > - if (sysctl_hung_task_warnings > 0)
> > > > + if (sysctl_hung_task_warnings > 0) {
> > > > sysctl_hung_task_warnings--;
> > > > + pr_err("sysctl_hung_task_warnings: %d\n",
> > > > + sysctl_hung_task_warnings);
> > > > + }
> > >
> > > It is too much noise. But it might make sense to report it when
> > > the counter gets down to zero. Something like:
> > >
> > > if (sysctl_hung_task_warnings)
> > > pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
> > >
> > > and move this down after printing all the details for this hung task report.
> > I'm fine to print this warning when it gets down to 0.
>
> I prefer this variant.
It's ok, Yuanli will submit a patch for this.
>
> > The warning counter is useful to detect kernel log lost or not, if add
> > a new line to print this count is too noise,
> > how about append this counter at the end of the following line:
> > pr_err("INFO: task %s:%d blocked for more than %ld seconds,
> > sysctl_hung_task_warnings: %d\n",
> > t->comm, t->pid, (jiffies - t->last_switch_time) / HZ,
> > sysctl_hung_task_warnings);
>
> Is it really important to print this on every line, please? IMHO, most people
> do not care. IMHO. it would add more harm (confusion, noise) than good.
>
Thanks
> Best Regards,
> Petr