Re: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

From: Weiping Zhang
Date: Sun Jan 29 2023 - 04:11:20 EST


Sorry for the late reply.

On Fri, Jan 13, 2023 at 9:43 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Thu 2023-01-12 17:17:45, Weiping Zhang wrote:
> > This patch try to add more debug info to detect lost kernel log or no
> > hung task was detected.
> >
> > The user set 10 to the hung_task_timeout_secs, the kernel log:
> >
> > [ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
> > [ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
> > [ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
> > [ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.

Expect but no hung task log at this moment.

> > [ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.
> >
> > There is no any log at about 3983, it's hard to know if kernel log was
> > lost or there is no hung task was detected at that moment. So this patch
> > print sysctl_hung_task_warnings to distinguish the above two cases.
> >
> > Signed-off-by: Weiping Zhang <zhangweiping@xxxxxxxxxxxxxx>
> > ---
> > kernel/hung_task.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > index c71889f3f3fc..ca917931473d 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > * complain:
> > */
> > if (sysctl_hung_task_warnings) {
> > - if (sysctl_hung_task_warnings > 0)
> > + if (sysctl_hung_task_warnings > 0) {
> > sysctl_hung_task_warnings--;
> > + pr_err("sysctl_hung_task_warnings: %d\n",
> > + sysctl_hung_task_warnings);
> > + }
>
> It is too much noise. But it might make sense to report it when
> the counter gets down to zero. Something like:
>
> if (sysctl_hung_task_warnings)
> pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
>
> and move this down after printing all the details for this hung task report.
I'm fine to print this warning when it gets down to 0.

The warning counter is useful to detect kernel log lost or not, if add
a new line to print this count is too noise,
how about append this counter at the end of the following line:
pr_err("INFO: task %s:%d blocked for more than %ld seconds,
sysctl_hung_task_warnings: %d\n",
t->comm, t->pid, (jiffies - t->last_switch_time) / HZ,
sysctl_hung_task_warnings);
>
> > pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
> > t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
> > pr_err(" %s %s %.*s\n",
>
> Best Regards,
> Petr

Thanks a ton
Weiping