Re: [RFC][Patch] Adding kmsg_dump() to reboot/halt/poweroff/emergency_restartpath

From: Aaron Durbin
Date: Wed Nov 03 2010 - 17:51:51 EST


On 10/27/10 12:44, Seiji Aguchi wrote:
Hi,

What actual problem are we solving here? Why is the current code
inadequate? It would help to demonstrate some use-case and to explain
how the situation improved with this patch.

[Purpose]
My purpose is developing highly reliable logging facility for
enterprise use.

I'm planning to add the following triggers of kmsg_dumper().
- reboot/poweroff/halt/emergency_restart (this patch)
- Machine check

I'm also planning to add an feature outputting kernel messages to
NVRAM,
because NVRAM is equipped with enterprise servers.
We can realize highly reliable logging facility by outputting
kernel messages to NVRAM.
(NVRAM is commonly used on Mainframe and Commercial Unix as well.)

[Use case of reboot/poweroff/halt/emergency_restart]

My company has often experienced the followings in our support service.
- Customer's system suddenly reboots.
- Customers ask us to investigate the reason of the reboot.

We recognize the fact itself because boot messages remain in
/var/log/messages.
However, we can't investigate the reason why the system rebooted,
because the last messages don't remain.
And off course we can't explain the reason.


We can solve above problem with this patch as follows.
Case1: reboot with command
- We can see "Restarting system with command:" or ""Restarting
system.".

Case2: halt with command
- We can see "System halted.".

Case3: poweroff with command
- We can see " Power down.".

Case4: emergency_restart with sysrq.
- We can see "Sysrq:" outputted in __handle_sysrq().

Case5: emergency_restart with softdog.
- We can see "Initiating system reboot" in watchdog_fire().

So, we can distinguish the reason of reboot, poweroff, halt and
emergency_restart.

If customer executed reboot command, you may think the customer
should know the fact.
However, they often claim they don't execute the command when they
rebooted system by mistake.

No evidential message remain on current Linux kernel, so we can't
show the proof to the customer.
This patch improves this situation.

Seiji

We carry patches in our kernels that do very similar things. The reason is essentially the same as what you have cited. On our platforms we have two different ways of storing events to an event log. One communicates with the BIOS itself; the other writes bit flags to a known area of non-volatile storage. That way when the machine comes back up we have a clear eventlog (with times) as to what happened when. Piecing these events together has proven to be invaluable for finding issues.

For both of the drivers that log these events they use a shared interface that collect various events in the kernel and present them through a single notifier chain for the drivers' consumption.

The things we currently track and log are the following:
- clean reboot/shutdown
- panic
- oops
- die
- NMI watchdog

An example eventlog produced by our systems looks like the following (63-67 are the boot numbers of the system in question):

2010-10-14 10:26:06 | System Reset | 63
2010-10-14 10:26:19 | System boot | 63
2010-10-14 11:36:43 | Kernel Shutdown | 63 | Unknown Shutdown Reason
2010-10-14 11:36:43 | System Reset | 64
2010-10-14 11:36:56 | System boot | 64
2010-10-18 14:51:54 | Kernel Shutdown | 64 | Clean
2010-10-18 14:52:38 | System Reset | 65
2010-10-18 14:52:51 | System boot | 65
2010-10-26 02:44:48 | Kernel Shutdown | 65 | Oops
2010-10-26 02:44:48 | Kernel Shutdown | 65 | Die
2010-10-26 02:44:49 | Kernel Shutdown | 65 | Panic
2010-10-26 02:45:43 | System Reset | 66
2010-10-26 02:45:56 | System boot | 66
2010-10-26 02:49:22 | Kernel Shutdown | 66 | Clean
2010-10-26 02:50:05 | System Reset | 67
2010-10-26 02:50:18 | System boot | 67
2010-10-26 11:39:20 | Kernel Shutdown | 67 | Clean

Hope that helps others know that we think such a mechansim is vital. I can post the patches for the common infrastructure if people are interested.

-Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/