Re: [patch 0/9] kdump: Patch series for s390 support

From: Valdis . Kletnieks
Date: Sat Jul 09 2011 - 14:00:07 EST


On Thu, 07 Jul 2011 15:33:21 EDT, Vivek Goyal said:
> On Wed, Jul 06, 2011 at 11:24:47AM +0200, Michael Holzheu wrote:

> > S390 stand-alone dump tools are independent mini operating systems that
> > are installed on disks or tapes. When a dump should be created, these
> > stand-alone dump tools are booted. All that they do is to write the dump
> > (current memory plus the CPU registers) to the disk/tape device.
> >
> > The advantage compared to kdump is that since they are freshly loaded
> > into memory they can't be overwritten in memory.
>
> > Another advantage is
> > that since it is different code, it is much less likely that the dump
> > tool will run into the same problem than the previously crashed kernel.
>
> I think in practice this is not really a problem. If your kernel
> is not stable enough to even boot and copy a file, then most likely
> it has not even been deployed. The very fact that a kernel has been
> up and running verifies that it is a stable kernel for that machine
> and is capable of capturing the dump.

Vivek: I used to do VM/XA on S/390 boxes for a living, and that's *not* where
Michael is coming from.

What the standalone dump code does is take a system that may have the moral
equivalent of 256 separate PCI buses, several hundred disks all visible in
multipath configurations, dozens of other devices, and as long as you can find
*one* console and *one* tape/disk drive that works, you can capture a dump.

More than once in my career, I got into a situation where the production system
would hang - and booting off another disk that contained an older copy with
maybe a few less patches would *also* hang. VM/XA would simply *not run*.
Booting the standalone dump utility (which shared zero code with VM/XA, and did
*much* less initialization of I/O devices not needed for the actual dump) would
work just fine. This would get me a dump that would show that we had a
(usually) hardware issue - either we were tripping over an errata that *no*
released version of VM/XA had a workaround for, or outright defective hardware.

For the same efficiency reasons that Linux doesn't do a lot of checking for
"can never happen" cases, VM/XA doesn't check some things. So when busted
hardware would present logically impossible combinations of status bits (for
instance, "device still connected" but "I/O bus disconnected"), Bad Things
would happen. Booting a tiny dump program that never even *tried* to look at
the bad bits posted by the miscreant hardware would allow you to get the info
you needed to debug it.

*THAT* is the use case - when you have one customer out there in East Podunk
who is consistently managing to hang their system so hard you can't get enough
info out of it to figure out what's broken.


Attachment: pgp00000.pgp
Description: PGP signature