Re: [PATCH] rcu: allow multiple stalls before panic

From: Chao Zhou
Date: Fri Sep 04 2020 - 15:40:45 EST


Thanks Paul. Appreciated it.

Initial intent was to give users a way to make their system more
tolerable, but prudent enough to recover if suspicious behavior
reaches a watermark. If a system experiences multiple stalls in one
lifetime, no matter how healthy it looks or whether the stalls are
from different sources, we still want it to dramatically recover.
Please share your guidance?

eero inc.

660 3rd St, 4th Floor

San Francisco, CA 94107



On Fri, Sep 4, 2020 at 11:05 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Sun, Aug 30, 2020 at 11:41:17PM -0700, chao wrote:
> > Some stalls are transient and system can fully recover.
> > Allow users to configure the number of stalls experienced
> > to trigger kernel Panic.
> >
> > Signed-off-by: chao <chao@xxxxxxxx>
>
> Hearing no objections, I have queued this with wordsmithing as shown
> below. Please let me know if I messed something up.
>
> One question, though. It looks like setting this to (say) 5 would panic
> after the fifth RCU CPU stall warning message, regardless whether all
> five were reporting the same RCU CPU stall event or whether they instead
> were five widely separated transient RCU CPU stall events, where the
> system fully recovered from each event. Is this the intent?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit e710c928fb52d8e56bc6173515805301da6aa22b
> Author: chao <chao@xxxxxxxx>
> Date: Sun Aug 30 23:41:17 2020 -0700
>
> rcu: Panic after fixed number of stalls
>
> Some stalls are transient, so that system fully recovers. This commit
> therefore allows users to configure the number of stalls that must happen
> in order to trigger kernel panic.
>
> Signed-off-by: chao <chao@xxxxxxxx>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
>
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 500def6..fc2dd3f 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -536,6 +536,7 @@ extern int panic_on_warn;
> extern unsigned long panic_on_taint;
> extern bool panic_on_taint_nousertaint;
> extern int sysctl_panic_on_rcu_stall;
> +extern int sysctl_max_rcu_stall_to_panic;
> extern int sysctl_panic_on_stackoverflow;
>
> extern bool crash_kexec_post_notifiers;
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index 0fde39b..228c55f 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -13,6 +13,7 @@
>
> /* panic() on RCU Stall sysctl. */
> int sysctl_panic_on_rcu_stall __read_mostly;
> +int sysctl_max_rcu_stall_to_panic __read_mostly;
>
> #ifdef CONFIG_PROVE_RCU
> #define RCU_STALL_DELAY_DELTA (5 * HZ)
> @@ -106,6 +107,11 @@ early_initcall(check_cpu_stall_init);
> /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */
> static void panic_on_rcu_stall(void)
> {
> + static int cpu_stall;
> +
> + if (++cpu_stall < sysctl_max_rcu_stall_to_panic)
> + return;
> +
> if (sysctl_panic_on_rcu_stall)
> panic("RCU Stall\n");
> }
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 287862f..1bca490 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2651,6 +2651,17 @@ static struct ctl_table kern_table[] = {
> .extra2 = SYSCTL_ONE,
> },
> #endif
> +#if defined(CONFIG_TREE_RCU)
> + {
> + .procname = "max_rcu_stall_to_panic",
> + .data = &sysctl_max_rcu_stall_to_panic,
> + .maxlen = sizeof(sysctl_max_rcu_stall_to_panic),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ONE,
> + .extra2 = SYSCTL_INT_MAX,
> + },
> +#endif
> #ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE
> {
> .procname = "stack_erasing",