Re: [RFC PATCH] watchdog: Adding softwatchdog

From: peter enderborg
Date: Sun Apr 25 2021 - 02:42:12 EST


On 4/25/21 3:08 AM, Tetsuo Handa wrote:
> On 2021/04/25 1:19, peter enderborg wrote:
>>> I don't think this proposal is a watchdog. I think this proposal is
>>> a timer based process killer, based on an assumption that any slowdown
>>> which prevents the monitor process from pinging for more than 0.5 seconds
>>> (if HZ == 1000) is caused by memory pressure.
>> You missing the point. The oom killer is a example of a work that it can do.
>> it is one policy. The idea is that you should have a policy that fits your needs.
> Implementing policy which can run in kernel from timer interrupt context is
> quite limited, for it is not allowed to perform operations that might sleep. See
>
> [RFC] memory reserve for userspace oom-killer
> https://urldefense.com/v3/__https://lkml.kernel.org/r/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz*u3bFFyOLg@mail.gmail.com__;Kw!!JmoZiZGBv3RvKRSx!tqBFKAdfydRJ5M0oP4xCRvSscrBwChj5MWuj1YUNAk05uORWkbcz-iodFCHYjKdOytmHoO4$
>
> for implementing possibly useful policy.

I you need to do a more complex approach you might need to
have a work queue.  For example a SIGTERM solution might
be like that. You send sigterm wait some time and then send a sigkill.


>> oom_score_adj is suitable for a android world. But it might be based on
>> uid's if your priority is some users over other. Or a memcg. Or as
>> Christophe Leroy want the current. The policy is only a example that
>> fits a one area.
> Horrible idea. Imagine a kernel module that randomly sends SIGTERM/SIGKILL
> to "current" thread. How normal systems can survive? A normal system is not
> designed to survive random signals.

I think you need to see it in the context of a watchdog. It might be
problematic, but it has a good statistical change to hit a cpu hogger. 

And seeing as watchdog, the alternative is a system reset. You
take a chance.  Reboot should be the last resort.

I can imagine a kernel module that  randomly sends SIGTERM/SIGKILL,
we already have that. It is called oom-kill. This is *exactly* the problem.

>
>> You need to describe your prioritization, in android it is
>> oom_score_adj. For example I would very much have a policy that sends
>> sigterm instead of sigkill.
> That's because Android framework is designed to survive random signals
> (in order to survive memory pressure situation).
It using a lot to control the system. It use it differently than you would
with a shell or window-manager.
>
>> But the integration with oom is there because
>> it is needed. Maybe a bad choice for political reasons but I don't it a
>> good idea to hide the intention. Please don't focus on the oom part.
> I wonder what system other than Android framework can utilize this module.
I think it will be useful for embedded systems as well.
> By the way, there already is "Software Watchdog" ( drivers/watchdog/softdog.c )
> which some people might call it "soft watchdog". It is very confusing to name
> your module as "softwatchdog". Please find a different name.
>
It is mention in the patch-set. I had as an idea to add this function to that one,
but I decided that it was better to separate so point out the feature  that is to
be "Soft" rather than so hard.