Re: Linux kernel and disaster recovery.

Peter Benie (pjb1008@cam.ac.uk)
Wed, 18 Jun 1997 13:45:35 +0100


TIGRANA@dstiuk.ccmail.compuserve.com writes ("Linux kernel and disaster recovery."):
> What if the Linux server itself crashes? If it is under UPS and
> there was some clever kernel module that would be able to somehow
> save the state of all (or specific) running processes and write to a
> separate disk partition and then after reboot to be able to restore
> the "memory dump" from the partition into memory thus revitalising
> all those running processes that would be very nice.

The term you are looking for is "checkpointing". You take a snapshot
of a process every so often and when the machine is rebooted after a
crash, you can restore the process to the state of the snapshot.
Alternatively, you can stop a process and start it again later
(perhaps on a different machine).

> Of course, I understand that the network sockets will be lost but it
> is fine because with the scheme described above one simply
> reattaches to the sessions using the UNIX domain socket and resumes
> it.

There are other issues involved, such restoring pgids, ttys, processes
in pipes, etc. If you want to implement checkpointing properly, it's a
non-trivial amount of work to do. If you want to do this just for your
one server process, then you should just write your server so that it
can cope with being restarted.

> Any ideas as to whether it can be implemented at all? I would be
> interested in both global solutions i.e. the whole kernel, perhaps
> even with all the filesystems

Restoring the whole kernel is difficult and has little advantage. You
want processes not to be able to tell that they've been saved and
restored. This doesn't require having exactly the same kernel - the
kernel contains all kinds of data that the process doesn't get to see
and which cannot influence the process and that data doesn't need to
be restored. Furthermore, then environment of the kernel will change -
all the hardware will have been reset, and you may even have different
hardware. Restoring the kernel's state from when the hardware was
running normally will not work.

> and local i.e.only specified processes
> are saved and restored, together with the kernel objects that they
> have allocated (IPC, descriptors, ttys etc).

This is a much more achievable goal. (SGIs can do it.)

Peter