Re: [PATCH RFC smp] Remove diagnostics and adjust config for CSD lock diagnostics

From: Paul E. McKenney
Date: Tue Mar 21 2023 - 11:39:08 EST


On Tue, Mar 21, 2023 at 11:22:20AM +0100, Peter Zijlstra wrote:
> On Mon, Mar 20, 2023 at 05:54:39PM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > This series removes CSD-lock diagnostics that were once very useful
> > but which have not seen much action since. It also adjusts Kconfig and
> > kernel-boot-parameter setup.
> >
> > 1. locking/csd_lock: Add Kconfig option for csd_debug default.
> >
> > 2. locking/csd_lock: Remove added data from CSD lock debugging.
> >
> > 3. locking/csd_lock: Remove per-CPU data indirection from CSD
> > lock debugging.
> >
> > 4. kernel/smp: Make csdlock_debug= resettable.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > Documentation/admin-guide/kernel-parameters.txt | 17 -
> > b/Documentation/admin-guide/kernel-parameters.txt | 6
> > b/kernel/smp.c | 2
> > b/lib/Kconfig.debug | 9
> > kernel/smp.c | 260 ++--------------------
> > 5 files changed, 47 insertions(+), 247 deletions(-)
>
> Yay!! How do you want to route these, should I take them through tip?

Either way works for me. If you take them into -tip, I will drop them
from -rcu. If you don't take them into -tip, I will send Linus a pull
request for the upcoming merge window. And if you take them at just
the wrong time, we will both send them to Linus. ;-)

Your choice!

> What about the rest of the thing? Your commits seem to suggest it's
> still actually used -- why? Are there still more virt bugs?

Thus far, no luck. I proposed ditching some of the stack traces, but
that got shot down.

These find the following issues: (1) CPU looping with interrupts
disabled. (2) CPU stuck in a longer-than-average SMI handler or other
firmware sand trap. (3) CPU fail stopped.

In theory, we could drop the RCU CPU stall warning to five seconds and
catch this same stuff. Unfortunately, in practice, there would need to
be lots of churn from CPUs looping with preemption disabled. Which we
still get from time to time even at 21 seconds.

NMIs can be used to deal with #1, and the hard lockup detector in fact
sort of does this. But these are not helpful for #2 and #3.

So nothing yet, but I am still looking for improved diagnostics.

Thanx, Paul