Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection

From: Jeffrey Merkey
Date: Tue Feb 02 2016 - 17:40:41 EST


>
> Please remember to add version history, so I can tell what changed.
>

What command do I give to git when it creates the patch from git
format-patch that outputs what you are looking for or do I have to add
that manually. The diff of files changed?

>
> I am not sure I am a fan of this. You are taking a known macro BUG_ON with
> known expectations and perversely converting it into an 'asm'. So now when
> folks read the code they scratch their heads why we are dumping the stack
> twice when in fact we are not. It seems misleading. :-/
>

1. Does not dump the stack at all the way it is coded -- look again.
The current code dumps it only once. Just executes an int3 and
returns instead of crashing. If you called panic all the time instead
of conditionally in this code, this change would not be needed, since
panic is setup already to call debuggers. It's the failure of the
current code to do that requires this change. How about you call
panic when this condition ocurrs, then the debugger will get called.

2. BUG() outputs an asm("ud2") and triggers an invalid instruction
and system crash. All that was added is the ability to switch that
ud2 to an int3. So what is more perverse here:

BUG() = ud2 -> invalid instruction -> trap -> call crash code ->
debugger -> then hang
BUG() = int3 -> int3 trap -> enter debugger -> return - system can recover

Because:

BUG() = Debugger = int3
and
BUG() != ud2 (undefined instruction) = crash = non recoverable

int3 (0xCC) has always been understood to mean BUG(). int3
breakpoints are an integral part of Intel's architecture. There is no
reason for not exploiting this capability of their processors to help
kernel developers use intel technology better.

> I still don't understand why we can't use Ingo's or tglx's approach? Your
> changelog doesn't point out the problems there.
>

Because when you catch a bug in the hard lockup detector the system
just sits there hard hung and you are not able to get into a debugger
console since the system has crashed and the watchdog code has already
killed off the other processors and locked up all the NMI interrupt
handlers, thereby preventing any debugger at all from functioning
other than a hardware ice, so it's a hell of a lot easier just to
trigger a break when you detect the first instance of a hard lockup
before the system is completely hosed.

So, let's try to use Ingo's and tglx's approach. For some reason
neither Ingo of tglx seem to understand that I am referring to a
normal user of a system that gets a hard hang that may or may not be
reproducable. You are not able to even get into a debugger console
when the lockup occurs unless you have a breakpoint already set. If
the current code always calls panic problem solved -- panic triggers
debugger entry if it detects one.

Jeff

> Cheers,
> Don
>