Re: [PATCH] [RESEND] rlimits: Print more information when limitsare exceeded

From: Arun Raghavan
Date: Fri Mar 30 2012 - 13:19:07 EST


On Fri, 2012-03-30 at 16:29 +0200, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, David Henningsson wrote:
> > On 03/30/2012 03:39 PM, Thomas Gleixner wrote:
> > > On Fri, 24 Feb 2012, Arun Raghavan wrote:
> > >
> > > > This dumps some information in logs when a process exceeds its CPU or RT
> > > > limits (soft and hard). Makes debugging easier when userspace triggers
> > > > these limits.
> > >
> > > Why do we need to spam the logs with such information?
> > >
> > > SIGXCPU is only ever sent by this code. If there is a signal handler
> > > in the application it's easy to debug. If not it's even easier, the
> > > thing will simply be killed and you get the reason printed.
> >
> > I'm not totally sure, but don't we log SIGSEGVs? If so, the same reasoning
> > would apply to SIGSEGV.
>
> I think so. Dunno why this was added in the first place. core dumps or
> proper signal handlers are telling you usually more than that single
> line in dmesg.
>
> > > For the SIGKILL case there only a limited number of reasons why a
> > > SIGKILL is sent. So no, I rather commit a patch which removes that
> > > ugly printk which is already there instead of adding more of them.
> >
> > The reason I proposed some kind of printk for SIGKILL, was to get some
> > diagnostic information out of the SIGKILL. E g, if you have two threads both
> > running on rtprio rlimits in the same process, it would be very interesting to
> > know which one of them was causing the kernel to send SIGKILL.
>
> Usually the one which ignored SIGXCPU for quite a while. There is a
> reason why SIGXCPU can be handled by the application.

In general I agree -- I'm happy to rewrite the patch to drop the printk
in the SIGXCPU case.

In the current situation that I'm debugging, there appears to be a
kernel fragment that's busy waiting and eventually gets killed (I'll be
taking up a fix for this separately). In this case, by the time we get
back control, the hard limit seems to be already hit. Knowing the
culprit thread in this case does make things simpler for us.

> > Also, it could be useful to know whether the SIGKILL was actually sent by the
> > kernel, or by some other process feeling evil (e g "kill -9").
>
> Agreed, but instead of adding that printk to the rlimit code I prefer
> a generic infrastructure which can be used by all call sites which
> issue SIGKILL. Something like: [__]kill_it(flags, task, "Reason");

The other paths that call SIGKILL seem to be slightly different (going
eventually via do_send_sig_info()). Is this actually functionally the
same? If yes, I'll try to rewrite the patch to consolidate some of these
paths as you suggest.

-- Arun

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/