Re: The case for a standard kernel debugger

From: richardj_moore@uk.ibm.com
Date: Fri Sep 15 2000 - 04:42:54 EST


I think the case for the kernel debugger is better stated as the case for
RAS (Reliability, Serviceability and Availability) in the kernel, in other
words, there is a case for having the right diagnostic, reporting and
recovery tools in the right place at the right time. A kdb does not fulfil
all diagnostic RAS needs. IMHO it's an extremely powerful developement
tool, but hang on so is a logic analyser and a source-level debugger. It
can also be a real pain if trying to debug HLL source using an assembler
based debugger. The point is, one generally needs a debugger that matches
the semantics that the programmer is dealing with. If its, assembler code
the so be it, use a kdb. If you're poking around with H/W specific
interfaces and system busses the you make need a lower level tool.

But should a kdb be a standard part of the kernel for use in
production/commercial/enterprise environments? I don't believe so. Looking
back at the techniques we've deployed over the years to debug system
problems in commercial environments, we only ever had the luxury of using a
kdb with OS/2. Just about every other OS we supported did not have a kdb.
The OS/2 case is interesting because initially, we had only the kdb for
debugging, and it was the worst platform for serviceability that we ever
supported. We couldn't debug those typically obscure problems that occurred
only in production environments and which could never be readily
re-created in the lab. We took an enormous amount of pain over this from
our customers over poor serviceability. They hated every minute of
production time we took from them when a developer took control of their
systems in order to debug, or in many cases not debug the problem. Of
course we had created a rod for our own backs. Customers knew we never sent
developers on site to debug OS/390 (or MVS as it was called in those days).
They also knew that we never rejected a problem because it was not
re-creatable and we never even asked for a re-creation scenario. The reason
for this was that we had appropriate RAS capability in MVS which allowed
data to be captured automatically at fault time combined with a certain
amount of self-healing capability and automated recovery. What we did to
OS/2 to make it approach this level of RAS capability was to implement a
system dump capability - similar to SGI's kernel crash dump, + a
comprehensive system tracing facility that could be dynamically customised
to tracing events in any code path without any code recompilation - IBM's
Dynamic Probes for Linux is an initial port of the capability + a
comprehensive and customisable virtual storage based dump, a bit like core
dump, except that it could dump process trees if required and memory from
not only from user space, but from system space based upon kernel
sub-component, for example file-system structures etc..

That capability completely transformed our ability to debug serious and
obscure problems, with minimal disruption. It's true that we weren't
immediately successful when we implemented this stuff. There's a major
learning curve and mind-set change required to work with captured data as
opposed to interactive debugging.

We didn't throw away the kdb, it's still very useful:

1) as a didactic tool.
2) for the final stages of problem determination - every problem is
re-creatable once you know the triggers. And when you do, which you can get
from dumps and traces, then you can set up a lab-based experiment where you
use a debugger to solve the final mystery.
3) in production for those exceedingly rare cases where we needed to know
what the underlying hardware was up to - it's a cheaper option than using a
logic analyser.

One big argument against RAS of any sort is that it bloats the kernel and
not every one wants it (until they have a problem). A further argument with
Linux is that you may have to do quite a bit of hard work to get the subset
of RAS you need to co-exist, if it exists at all. Something we're working
on which may help resolve this, and will be made available with the next
drop of Dynamic Probes is Generalised Kernel Hooks Interface (GKHI). The
idea here is to make all our RAS function the option of being dynamically
loadable kernel modules. In most cases we don't need to modify kernel
function, just get control at the right time. So we place hooks in kernel
source, which remain dormant until activated by the GKHI when a RAS module
asks it to. Maybe this will provide a way out of the difficulty.

Richard Moore - RAS Project Lead - Linux Technology Centre.

http://oss.software.ibm.com/developerworks/opensource/linux
Office: (+44) (0)1962-817072, Mobile: (+44) (0)7768-298183
PISC, MP135 Galileo Centre, Hursley Park, Winchester, SO21 2JN, UK

Keith Owens <kaos@ocs.com.au> on 13-09-2000 10:49:50 PM

Please respond to Keith Owens <kaos@ocs.com.au>

To: linux-kernel@vger.kernel.org
cc: torvalds@transmeta.com (bcc: Richard J Moore/UK/IBM)
Subject: The case for a standard kernel debugger

Resend, this time with cc: torvalds.

This note puts the case for including a kernel debugger in the master
tarballs. These points do not only apply to kdb, they apply to any
kernel debugger. Comments about the perceived deficiencies of kdb,
kgdb, xmon or any other debugger are not relevant here, nor are
questions about how or when debuggers should be activated. I want to
concentrate of whether the kernel should have *any* standard debugger
at all.

If Linus still says "no" to including any debugger in the master
tarball then that will be the end of this thread as far as I am
concerned. I will then talk to distributors about getting a debugger
included in their kernels as a patch. Hopefully the distributors who
want a kernel debugger can then agree on a standard one.

Disclaimer: Part of my paying job is to maintain kdb. SGI want kdb to
            be used more widely to benefit from GPL support. More eyes
            and hands means better code for everybody.

(1) Random kernel errors are easier to document and report with a
    debugger. Oops alone is not always enough to find a problem,
    sometimes you need to look at parameters and control blocks. This
    is particularly true for hardware problems.

(2) Support of Linux in commercial environments will benefit from a
    standard kernel debugger. The last thing we want is each
    commercial support contract including a different debugger and
    supplying different bug reports. Bug reports on supported systems
    should go to the support contractor but some will filter through to
    the main linux lists.

(3) Architecture consistency. Sparc, mips, mips64, ppc, m68k, superh,
    s390 already have remote debugger support in the standard kernel.
    i386, alpha, sparc64, arm, ia64 do not have standard debuggers,
    they have to apply extra patches. Some architectures have extra
    debugger code in addition to the remote gdb support.

(4) Debugger consistency. Back in 1997 there were a lot of individual
    kernel debugging patches going around for memory leaks, stack
    overflow, lockups etc. These patches conflicted with each other so
    they were difficult for people to use. I built the original set of
    Integrated Kernel Debugging (IKD) patches because I contend that
    having a standard debugging patch instead of lots of separate ones
    is far easier for everybody to use. The same is true of a kernel
    debugger, having one standard debugger that all kernels use will
    improve the productivity of everyone who has to support kernel
    code, no need to learn the semantics of multiple separate
    debuggers.

(5) Easier for kernel beginners to learn the kernel internals. Having
    worked on 10+ operating systems over the years, I can testify that
    some form of kernel/OS tracing facility is extremely useful to get
    people started. I agree with Linus when he said

       "'Use the Source, Luke, use the Source. Be one with the code.'.
       Think of Luke Skywalker discarding the automatic firing system
       when closing on the deathstar, and firing the proton torpedo (or
       whatever) manually. _Then_ do you have the right mindset for
       fixing kernel bugs."

    But Linus has also said "The main trick is having 5 years of
    experience with those pesky oops messages ;-)". Beginners need
    some way of getting that experience. Reading the source from a
    cold start is an horrendous learning curve, debuggers help to see
    what the source is really doing. Always remember that 90%+ of
    kernel users are beginners, anything that helps to convert somebody
    from kernel beginner to kernel expert cannot be bad.

(6) I contend that kernel debuggers result in better patches, most of
    the time. Sometimes people misuse a debugger, as Linus said

       "I'm afraid that I've seen too many people fix bugs by looking
       at debugger output, and that almost inevitably leads to fixing
       the symptoms rather than the underlying problems."

    Does that occur? Of course it does, I have been guilty of that
    myself over the years. Is it inevitable? IMNSHO, no. Seven of
    the twelve architectures in the standard kernel already have built
    in debuggers. Where is the evidence that these architectures have
    more bad patches because of the presence of the debuggers?

    Even if somebody does submit a patch to fix the symptom instead of
    the problem, that alone is valuable information. Fixing the
    symptom focuses attention and the associated information helps to
    fix the real problem. We get problem patches even without
    debuggers (let's not mention the recent truncate problems ;) but
    there are enough eyes on the kernel to find problem patches and
    remove them. Adding a standard debugger will improve the quality
    of some of those eyes at a faster rate.

So how about it Linus? Does any of this change your mind about
including a standard kernel debugger?

Asbestos_underware_mode=ON.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Sep 15 2000 - 21:00:25 EST