Re: Seeking Linux watchdog design advice to trouble shoot mystorysilent reboot issue

From: Linus Walleij
Date: Wed Dec 14 2011 - 17:11:47 EST


On Mon, Dec 5, 2011 at 8:55 PM, Vincent Li <vincent.mc.li@xxxxxxxxx> wrote:

> we have  a complex system with a large number of processes running
> simutanously. If any of the processes gets into a faulty state and
> hangs or consumes more than its fair share of the system resources,
> the other processes may not get a chance to run, and the whole system
> can hang, interrupting the system functionality and user traffic.

Have you tried using RLIMITs?

Last time I used something like this from each process:

#include <sys/time.h>
#include <sys/resource.h>

struct rlimit rl;
int ret;

// No process run more than 5 seconds
rl.rlim_cur = rl.rlim_max = 5;
ret = setrlimit(RLIMIT_CPU, &rl);
// No realtime process run more than 1 second
rl.rlim_cur = rl.rlim_max = 1000000;
ret = setrlimit(RLIMIT_RTTIME, &rl);

The latter is good if you have real-time processes.

There are also RLIMITs for memory consumption.

Consult:
http://kernel.org/doc/man-pages/online/pages/man2/getrlimit.2.html

> CPU and memory control group features are not considered at this stage
> because it is too invasive to change in our custom kernel.

Do you mean that you are using an antique kernel with many custom
patches and you don't want to upgrade because it's a lot of work?
Mainlining your code and keeping each patch topic on a special
git branch (and using git) are recommended practices.

If you mean you have been stripping it down for footprint then
it's another thing which I can fully understand...

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/