Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection

From: Don Zickus
Date: Wed Feb 03 2016 - 15:15:09 EST


On Wed, Feb 03, 2016 at 10:23:42AM -0700, Jeffrey Merkey wrote:
> > Hmm, I am confused here. So you are saying because we are in the nmi
> > handler you can not break into the system? The nmi handler prints some
> > stuff to the screen, pokes the other cpus to print stuff to the screen and
> > then returns to a normal operation. Unless you are saying the act of
> > sending NMI IPIs never completes (because a cpu is blocking IPI
> > interrupts),
> > so the cpu hangs in nmi context and the debugger never has a chance to
> > 'break' in and see what is going on?
> >
> > Cheers,
> > Don
> >
>
> Yes. the nmi handlers never complete for the bug I worked on with
> tglx, probably because an nmi handler is calling timekeeper.c
> somewhere. Some of these lockup bugs may be calling code from the nmi
> handlers that cause the lockup condition in the first place in some
> cases, so it will never reach a call to panic. Looking over this code
> it's damn hard to find a good way to do this that works across all the
> arches without adding another macro to bug.h (BREAK_ON maybe), so I
> just used one that's already there. I'll go back and rethink this
> some more. It could just be as simple as calling panic from the first
> detection -- that works.

So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable
'hardlockup_panic', you should be able to achieve what you want, no?

But you mentioned you wanted to recover? Hence avoiding the panic?

Cheers,
Don