Re: KDB in the mainstream 2.4.x kernels?

From: Robin Holt (holt@sgi.com)
Date: Tue Jul 29 2003 - 14:44:40 EST


On Fri, Jul 18, 2003 at 10:43:57PM +0200, Andi Kleen wrote:
>
> One argument i have against it: KDB is incredibly ugly code.
> Before it could be even considered for merging it would need quite a lot
> of cleanup.
> ...

I believe there was a consensus reached that the ugliness was out of
necessity. I hope my understanding is correct.

>
> As for your home crash debugging I suspect you would be better of with LKCD
> or Mcore (crash dump, saving an crash image on some partition, with examining
> the crash dump after reboot)

I personally think _BOTH_ belong in a kernel. KDB for locating problems
in my code during development and a crash dump facility so I can look at
the problems that I don't believe are part of my code at a later time.
Additionally, when I am no longer debugging the code, I would rather
turn KDB off and leave LKCD in there to capture the particular
problem I am chasing. How many problems have been encountered on
user machines which have resulted in a reboot and ignore method?
This is no way to fix problems!

For me, and I believe I am not alone, the most infuriating problem is the
one line change in your code which "tickles" someone else's bug and I
spend 3 days trying to find what really went wrong.

I am sure that nearly anyone working in the bowels of the kernel has
also had a user land program that terminated with just a bus error.

>
> KDB is usually not useful for debugging hangs on desktop boxes (and even
> many servers) because you have usually X running. When the machine crashes and
> goes in KDB you cannot see the text output and debug anything. I learned
> to type "go<return>" blind when I had still an KDB aware kernel, but
> it's not very useful overall.

I believe that this could be addressed once KDB gets into the kernel. If
this were clearly stated as a condition for getting KDB in, then I am
sure someone can figure a method out.

>
> On a development machine you can avoid that by connecting a serial cable,
> but that's usually not easily possible for a desktop box. Doing a post-mortem
> after reboot is more practical. That is what LKCD/mcore do.
>
> Disadvantage is that the current crash dump mechanisms (lkcd, mcore
> crash, netdump) are all still not very reliable and have horrible
> error handling. And you can eat a lot of disk space for the dumps and
> they tend to overflow your file systems. But still it's the only
> realistic option for "desktop bugs"

I am not sure what you mean by LKCD is not reliable. I use LKCD to
create crash dumps at work all the time. The only problems I have
are when the NMI doesn't propagate correctly into LKCD and initiate
the dump. I believe Keith Owens has improved LKCD's understanding
of being called for KDB and now that seems to always work as well.

I think the error handling being referred to is the device driver itself.
That should probably get fixed on a case-by-case basis instead of
putting the blame on LKCD. I don't see driving force to fixing
drivers for a crash dump facility without a clear direction as to
which facility will be accepted.

As for space, I view that as an admin problem. If you are selling
a machine with a lot of memory, you need to size for the dump you
will typically get. I recently initiated an LKCD dump on a machine
with 8GB of memory and had dumping of kernel pages including buffers.
The uncompressed dump output was still only 161MB. I have also taken
a dump from a machine with over 128GB of memory in 64K pages shortly
after booting with init=/bin/sh. It took only 108MB. That doesn't
seem too large in the days of 120GB drives. This can be reduced even
further by using compression (Already supported by LKCD) and not dumping
kernel buffers.

>
> BTW debugging on the X server works on linuxppc/mac with xmon because it
> has a fbcon based console and X server. The debugger can just
> work on the X background while the X server is stopped. Very nifty.
> But I don't see the x86 XFree86 switching to a similar fbcon model any
> time soon, so it's unlikely to help.
>
> -Andi

I believe that crash dumps and in core debugging are useful for a lot of
others as well. If not, why have Red Hat's netdump and United Linux's
LKCD implementations been done? It seems the only people saying "no"
to dumps and in core debugging are the people who claim to benefit from
it least. I view that as equally arrogant to the PPC people being able
to say no to adding machine vectors required by ia64 on a "just because"
basis without justifying based on technical merit.

I agree with _EVERYTHING_ that Linas from IBM pointed out. What needs to
be done to get this in?

If you want to hear user testimonials, I have used KDB for the last
year. It has been invaluable in locating race conditions with spin locks
when you have large processor counts and you are stress testing a machine.

I have complained bitterly (ask Keith Owens) that LKCD needs to be
working. One of my primary uses of LKCD since Keith got it working for
our machine has been: hitting a problem; doing a cursory check to see
if it is something I introduced and debug with KDB. When I am certain I
am done using KDB, I take the dump and reanalyze using lcrash to ensure
the answer is consistent and do a cursory check for other failures which
may be hiding behind this bug. It also gives me a method of determining
if my fix introduced a new bug or if a future failure was from a long
running behavior which just now got bad enough. When I hit something new,
I now have "perfect recall" of the previous failure.

I believe this should be in the 2.6 kernel as well. I believe it
will save time during the bringup effort for countless people.

Robin Holt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jul 31 2003 - 22:00:42 EST