[RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

From: Arun Raghavan
Date: Sat Feb 18 2017 - 03:38:07 EST

Next message: Christophe JAILLET: "[PATCH] irqchip/qcom: Fix error handling"
Previous message: Christophe JAILLET: "[PATCH] soc: ti: knav_dma: Fix some error handling"
Next in thread: Arun Raghavan: "Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This dumps some information in logs when a process exceeds its CPU or RT
limits (soft and hard). Makes debugging easier when userspace triggers
these limits.

Signed-off-by: Arun Raghavan <arun@xxxxxxxxxxxxxxxx>
---
kernel/time/posix-cpu-timers.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

Hello,
This has come up a couple of times in the past, but we haven't been able to
resolve whatever issues were pointed out.

In the mean time, we have frustrated users who don't know where they're getting
a SIGKILL from, and I'd really like to have a way for people to not have to go
through this.

The issues that came up the last time were:

1. SIGXCPU messages shouldn't be needed since they can be caught: it's still
useful to have the log because it isn't always possible to pin down the
thread causing the problem in userspace.

2. SIGKILL logging should be centralised: there seem to be multiple paths that
trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
parameter on all of them for the KILL case. Any other suggestions on how to
deal with this?

I'm happy to fix this up to actually make it this time, but if there aren't
none, just pushing this out will make our lives a little less painful.

Thanks,
Arun

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index e9e8c10..6dbcf84 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
* At the hard limit, we just die.
* No need to calculate anything else now.
*/
+ printk(KERN_INFO
+ "CPU Watchdog Timeout (hard): %s[%d]\n",
+ tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
return;
}
@@ -872,7 +875,7 @@ static void check_thread_timers(struct task_struct *tsk,
sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
}
printk(KERN_INFO
- "RT Watchdog Timeout: %s[%d]\n",
+ "RT Watchdog Timeout (soft): %s[%d]\n",
tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
}
@@ -980,6 +983,9 @@ static void check_process_timers(struct task_struct *tsk,
* At the hard limit, we just die.
* No need to calculate anything else now.
*/
+ printk(KERN_INFO
+ "RT Watchdog Timeout (hard): %s[%d]\n",
+ tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
return;
}
@@ -987,6 +993,9 @@ static void check_process_timers(struct task_struct *tsk,
/*
* At the soft limit, send a SIGXCPU every second.
*/
+ printk(KERN_INFO
+ "CPU Watchdog Timeout (soft): %s[%d]\n",
+ tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
if (soft < hard) {
soft++;
--
2.9.3

Next message: Christophe JAILLET: "[PATCH] irqchip/qcom: Fix error handling"
Previous message: Christophe JAILLET: "[PATCH] soc: ti: knav_dma: Fix some error handling"
Next in thread: Arun Raghavan: "Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]