Re: [PATCH] rcu: allow multiple stalls before panic

From: Paul E. McKenney
Date: Fri Sep 04 2020 - 16:37:21 EST


On Fri, Sep 04, 2020 at 12:40:29PM -0700, Chao Zhou wrote:
> Thanks Paul. Appreciated it.
>
> Initial intent was to give users a way to make their system more
> tolerable, but prudent enough to recover if suspicious behavior
> reaches a watermark. If a system experiences multiple stalls in one
> lifetime, no matter how healthy it looks or whether the stalls are
> from different sources, we still want it to dramatically recover.
> Please share your guidance?

I have no guidance in this case. I was just wanting to verify that the
patch was in fact doing what you want it to. And it sounds like it does,
so good!

I have queued this for testing and further review. If all goes well,
I would submit it upstream for the v5.11 merge window, that is, not the
upcoming merge window, but the one after that.

Thanx, Paul

> eero inc.
>
> 660 3rd St, 4th Floor
>
> San Francisco, CA 94107
>
>
>
> On Fri, Sep 4, 2020 at 11:05 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Sun, Aug 30, 2020 at 11:41:17PM -0700, chao wrote:
> > > Some stalls are transient and system can fully recover.
> > > Allow users to configure the number of stalls experienced
> > > to trigger kernel Panic.
> > >
> > > Signed-off-by: chao <chao@xxxxxxxx>
> >
> > Hearing no objections, I have queued this with wordsmithing as shown
> > below. Please let me know if I messed something up.
> >
> > One question, though. It looks like setting this to (say) 5 would panic
> > after the fifth RCU CPU stall warning message, regardless whether all
> > five were reporting the same RCU CPU stall event or whether they instead
> > were five widely separated transient RCU CPU stall events, where the
> > system fully recovered from each event. Is this the intent?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit e710c928fb52d8e56bc6173515805301da6aa22b
> > Author: chao <chao@xxxxxxxx>
> > Date: Sun Aug 30 23:41:17 2020 -0700
> >
> > rcu: Panic after fixed number of stalls
> >
> > Some stalls are transient, so that system fully recovers. This commit
> > therefore allows users to configure the number of stalls that must happen
> > in order to trigger kernel panic.
> >
> > Signed-off-by: chao <chao@xxxxxxxx>
> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> >
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 500def6..fc2dd3f 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -536,6 +536,7 @@ extern int panic_on_warn;
> > extern unsigned long panic_on_taint;
> > extern bool panic_on_taint_nousertaint;
> > extern int sysctl_panic_on_rcu_stall;
> > +extern int sysctl_max_rcu_stall_to_panic;
> > extern int sysctl_panic_on_stackoverflow;
> >
> > extern bool crash_kexec_post_notifiers;
> > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> > index 0fde39b..228c55f 100644
> > --- a/kernel/rcu/tree_stall.h
> > +++ b/kernel/rcu/tree_stall.h
> > @@ -13,6 +13,7 @@
> >
> > /* panic() on RCU Stall sysctl. */
> > int sysctl_panic_on_rcu_stall __read_mostly;
> > +int sysctl_max_rcu_stall_to_panic __read_mostly;
> >
> > #ifdef CONFIG_PROVE_RCU
> > #define RCU_STALL_DELAY_DELTA (5 * HZ)
> > @@ -106,6 +107,11 @@ early_initcall(check_cpu_stall_init);
> > /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */
> > static void panic_on_rcu_stall(void)
> > {
> > + static int cpu_stall;
> > +
> > + if (++cpu_stall < sysctl_max_rcu_stall_to_panic)
> > + return;
> > +
> > if (sysctl_panic_on_rcu_stall)
> > panic("RCU Stall\n");
> > }
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 287862f..1bca490 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2651,6 +2651,17 @@ static struct ctl_table kern_table[] = {
> > .extra2 = SYSCTL_ONE,
> > },
> > #endif
> > +#if defined(CONFIG_TREE_RCU)
> > + {
> > + .procname = "max_rcu_stall_to_panic",
> > + .data = &sysctl_max_rcu_stall_to_panic,
> > + .maxlen = sizeof(sysctl_max_rcu_stall_to_panic),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = SYSCTL_ONE,
> > + .extra2 = SYSCTL_INT_MAX,
> > + },
> > +#endif
> > #ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE
> > {
> > .procname = "stack_erasing",