Re: [RFC v5][PATCH 6/8] Checkpoint/restart: initial documentation

From: MinChan Kim
Date: Wed Sep 17 2008 - 02:23:38 EST


On Sun, Sep 14, 2008 at 8:06 AM, Oren Laadan <orenl@xxxxxxxxxxxxxxx> wrote:
> Covers application checkpoint/restart, overall design, interfaces
> and checkpoint image format.
>
> Signed-off-by: Oren Laadan <orenl@xxxxxxxxxxxxxxx>
> ---
> Documentation/checkpoint.txt | 207 ++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 207 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/checkpoint.txt
>
> diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
> new file mode 100644
> index 0000000..6bf75ce
> --- /dev/null
> +++ b/Documentation/checkpoint.txt
> @@ -0,0 +1,207 @@
> +
> + === Checkpoint-Restart support in the Linux kernel ===
> +
> +Copyright (C) 2008 Oren Laadan
> +
> +Author: Oren Laadan <orenl@xxxxxxxxxxxxxxx>
> +
> +License: The GNU Free Documentation License, Version 1.2
> + (dual licensed under the GPL v2)
> +Reviewers:
> +
> +Application checkpoint/restart [CR] is the ability to save the state
> +of a running application so that it can later resume its execution
> +from the time at which it was checkpointed. An application can be
> +migrated by checkpointing it on one machine and restarting it on
> +another. CR can provide many potential benefits:
> +
> +* Failure recovery: by rolling back an to a previous checkpoint
> +
> +* Improved response time: by restarting applications from checkpoints
> + instead of from scratch.
> +
> +* Improved system utilization: by suspending long running CPU
> + intensive jobs and resuming them when load decreases.
> +
> +* Fault resilience: by migrating applications off of faulty hosts.
> +
> +* Dynamic load balancing: by migrating applications to less loaded
> + hosts.
> +
> +* Improved service availability and administration: by migrating
> + applications before host maintenance so that they continue to run
> + with minimal downtime
> +
> +* Time-travel: by taking periodic checkpoints and restarting from
> + any previous checkpoint.
> +
> +
> +=== Overall design
> +
> +Checkpoint and restart is done in the kernel as much as possible. The
> +kernel exports a relative opaque 'blob' of data to userspace which can
> +then be handed to the new kernel at restore time. The 'blob' contains
> +data and state of select portions of kernel structures such as VMAs
> +and mm_structs, as well as copies of the actual memory that the tasks
> +use. Any changes in this blob's format between kernel revisions can be
> +handled by an in-userspace conversion program. The approach is similar
> +to virtually all of the commercial CR products out there, as well as
> +the research project Zap.
> +
> +Two new system calls are introduced to provide CR: sys_checkpoint and
> +sys_restart. The checkpoint code basically serializes internal kernel
> +state and writes it out to a file descriptor, and the resulting image
> +is stream-able. More specifically, it consists of 5 steps:
> + 1. Pre-dump
> + 2. Freeze the container
> + 3. Dump
> + 4. Thaw (or kill) the container
> + 5. Post-dump
> +Steps 1 and 5 are an optimization to reduce application downtime:
> +"pre-dump" works before freezing the container, e.g. the pre-copy for
> +live migration, and "post-dump" works after the container resumes
> +execution, e.g. write-back the data to secondary storage.
> +
> +The restart code basically reads the saved kernel state and from a
> +file descriptor, and re-creates the tasks and the resources they need
> +to resume execution. The restart code is executed by each task that
> +is restored in a new container to reconstruct its own state.
> +
> +
> +=== Interfaces
> +
> +int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
> + Checkpoint a container whose init task is identified by pid, to the
> + file designated by fd. Flags will have future meaning (should be 0
> + for now).
> + Returns: a positive integer that identifies the checkpoint image
> + (for future reference in case it is kept in memory) upon success,
> + 0 if it returns from a restart, and -1 if an error occurs.
> +
> +int sys_restart(int crid, int fd, unsigned long flags);
> + Restart a container from a checkpoint image identified by crid, or
> + from the blob stored in the file designated by fd. Flags will have
> + future meaning (should be 0 for now).
> + Returns: 0 on success and -1 if an error occurs.
> +
> +Thus, if checkpoint is initiated by a process in the container, one
> +can use logic similar to fork():
> + ...
> + crid = checkpoint(...);
> + switch (crid) {
> + case -1:
> + perror("checkpoint failed");
> + break;
> + default:
> + fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
> + /* proceed with execution after checkpoint */
> + ...
> + break;
> + case 0:
> + fprintf(stderr, "returned after restart\n");
> + /* proceed with action required following a restart */
> + ...
> + break;
> + }
> + ...
> +And to initiate a restart, the process in an empty container can use
> +logic similar to execve():
> + ...
> + if (restart(crid, ...) < 0)
> + perror("restart failed");
> + /* only get here if restart failed */
> + ...
> +
> +
> +=== Checkpoint image format
> +
> +The checkpoint image format is composed of records consistings of a
> +pre-header that identifies its contents, followed by a payload. (The
> +idea here is to enable parallel checkpointing in the future in which
> +multiple threads interleave data from multiple processes into a single
> +stream).
> +
> +The pre-header is defined by "struct cr_hdr" as follows:
> +
> +struct cr_hdr {
> + __s16 type;
> + __s16 len;
> + __u32 id;
> +};
> +
> +Here, 'type' field identifies the type of the payload, 'len' tells its
> +length in bytes. The 'id' identifies the owner object instance. The

which is right between id and parent?
It is confusing me :)

> +meaning of the 'id' field varies depending on the type. For example,
> +for type CR_HDR_MM, the 'id' identifies the task to which this MM
> +belongs. The payload also varies depending on the type, for instance,
> +the data describing a task_struct is given by a 'struct cr_hdr_task'
> +(type CR_HDR_TASK) and so on.
> +
> +The format of the memory dump is as follows: for each VMA, there is a
> +'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
> +name. Following comes the actual contents, in one or more chunk: each
> +chunk begins with a header that specifies how many pages it holds,
> +then a the virtual addresses of all the dumped pages in that chunk,
> +followed by the actual contents of all the dumped pages. A header with
> +zero number of pages marks the end of the contents for a particular
> +VMA. Then comes the next VMA and so on.
> +
> +To illustrate this, consider a single simple task with two VMAs: one
> +is file mapped with two dumped pages, and the other is anonymous with
> +three dumped pages. The checkpoint image will look like this:
> +
> +cr_hdr + cr_hdr_head
> +cr_hdr + cr_hdr_task
> + cr_hdr + cr_hdr_mm
> + cr_hdr + cr_hdr_vma + cr_hdr + string
> + cr_hdr_pgarr (nr_pages = 2)
> + addr1, addr2
> + page1, page2
> + cr_hdr_pgarr (nr_pages = 0)
> + cr_hdr + cr_hdr_vma
> + cr_hdr_pgarr (nr_pages = 3)
> + addr3, addr4, addr5
> + page3, page4, page5
> + cr_hdr_pgarr (nr_pages = 0)
> + cr_hdr + cr_mm_context
> + cr_hdr + cr_hdr_thread
> + cr_hdr + cr_hdr_cpu
> +cr_hdr + cr_hdr_tail
> +
> +
> +=== Changelog
> +
> +[2008-Sep-11] v5:
> + - Config is 'def_bool n' by default
> + - Improve memory dump/restore code (following Dave Hansen's comments)
> + - Change dump format (and code) to allow chunks of <vaddrs, pages>
> + instead of one long list of each
> + - Fix use of follow_page() to avoid faulting in non-present pages
> + - Memory restore now maps user pages explicitly to copy data into them,
> + instead of reading directly to user space; got rid of mprotect_fixup()
> + - Remove preempt_disable() when restoring debug registers
> + - Rename headers files s/ckpt/checkpoint/
> + - Fix misc bugs in files dump/restore
> + - Fix cleanup on some error paths
> + - Fix misc coding style
> +
> +[2008-Sep-04] v4:
> + - Fix calculation of hash table size
> + - Fix header structure alignment
> + - Use stand list_... for cr_pgarr
> +
> +[2008-Aug-20] v3:
> + - Various fixes and clean-ups
> + - Use standard hlist_... for hash table
> + - Better use of standard kmalloc/kfree
> +
> +[2008-Aug-09] v2:
> + - Added utsname->{release,version,machine} to checkpoint header
> + - Pad header structures to 64 bits to ensure compatibility
> + - Address comments from LKML and linux-containers mailing list
> +
> +[2008-Jul-29] v1:
> +In this incarnation, CR only works on single task. The address space
> +may consist of only private, simple VMAs - anonymous or file-mapped.
> +Both checkpoint and restart will ignore the first argument (pid/crid)
> +and instead act on themselves.
> --
> 1.5.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



--
Kinds regards,
MinChan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/