Re: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

From: Petr Mladek
Date: Mon Jan 30 2023 - 05:28:43 EST


On Sun 2023-01-29 17:08:29, Weiping Zhang wrote:
> Sorry for the late reply.
>
> On Fri, Jan 13, 2023 at 9:43 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> >
> > On Thu 2023-01-12 17:17:45, Weiping Zhang wrote:
> > > This patch try to add more debug info to detect lost kernel log or no
> > > hung task was detected.
> > >
> > > The user set 10 to the hung_task_timeout_secs, the kernel log:
> > >
> > > [ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
> > > [ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
> > > [ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
> > > [ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.
>
> Expect but no hung task log at this moment.
>
> > > [ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.
> > >
> > > There is no any log at about 3983, it's hard to know if kernel log was
> > > lost or there is no hung task was detected at that moment. So this patch
> > > print sysctl_hung_task_warnings to distinguish the above two cases.
> > >
> > > Signed-off-by: Weiping Zhang <zhangweiping@xxxxxxxxxxxxxx>
> > > ---
> > > kernel/hung_task.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > index c71889f3f3fc..ca917931473d 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > > * complain:
> > > */
> > > if (sysctl_hung_task_warnings) {
> > > - if (sysctl_hung_task_warnings > 0)
> > > + if (sysctl_hung_task_warnings > 0) {
> > > sysctl_hung_task_warnings--;
> > > + pr_err("sysctl_hung_task_warnings: %d\n",
> > > + sysctl_hung_task_warnings);
> > > + }
> >
> > It is too much noise. But it might make sense to report it when
> > the counter gets down to zero. Something like:
> >
> > if (sysctl_hung_task_warnings)
> > pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
> >
> > and move this down after printing all the details for this hung task report.
> I'm fine to print this warning when it gets down to 0.

I prefer this variant.

> The warning counter is useful to detect kernel log lost or not, if add
> a new line to print this count is too noise,
> how about append this counter at the end of the following line:
> pr_err("INFO: task %s:%d blocked for more than %ld seconds,
> sysctl_hung_task_warnings: %d\n",
> t->comm, t->pid, (jiffies - t->last_switch_time) / HZ,
> sysctl_hung_task_warnings);

Is it really important to print this on every line, please? IMHO, most people
do not care. IMHO. it would add more harm (confusion, noise) than good.

Best Regards,
Petr