RFD: Non-Disruptive Core Dump Infrastructure

From: Janani Venkataraman
Date: Tue Sep 03 2013 - 04:41:04 EST


Hello,

We are working on an infrastructure to create a system core file of a specific
process at run-time, non-disruptively. It can also be extended to a case where
a process is able to take a self-core dump.

gcore, an existing utility creates a core image of the specified process. It
attaches to the process using gdb and runs the gdb gcore command and then
detaches. In gcore the dump cannot be issued from a signal handler context as
fork() is not signal safe and moreover it is disruptive in nature as the gdb
attaches using ptrace which sends a SIGSTOP signal. Hence the gcore method
cannot be used if the process wants to initiate a self dump.

Previously the non-disruptive dump was tried with the Utrace approach [1].
First, all the threads would be assembled at a common place and quiesced using
UTRACE_INTERRUPT. Then the core dump would be triggered upon receiving the
event, indicating that the last thread of the process has quiesced, from its
quiesce callback. After several reviews and discussions, the Linux community
decided not to accept this proposal and has not pushed it upstream due to
various dependencies and potential risk of breaking existing implementations.
Hence the UTRACE approach is not being pursued. Also Roland had mentioned that
even if the approach worked smoothly,the pause could be a significant
perturbation [2].

Another approach was using the Freezer subsystem[3]. The freezer functions in
kernel essentially help start and stop sets of tasks and this approach
exploited the existing freezer subsystem kernel interface effectively to
quiesce all the threads of the application before triggering the core dump.
This approach was not accepted due to the potential Dos attack. Also the
community discussed that "freeze" is a bit dangerous because an application
which is frozen cannot be ended and while it's frozen and there is no
information "its frozen" via usual user commands as 'ps' or 'top'.

So ideally what we are trying to do is to export the infrastructure using
/proc/pid/core. Reading the file would give an ELF Format core-dump at that
instant non-disruptively,without killing the process.

This would involve basically three operations:

1) Holding the threads of a process without sending a signal (SIGSTOP). At this
point we can collect the register set snapshot and collect other information
required to create the ELF header. The above operation could be initiated with
the open() call.
2) Once the ELF header is created, read() can return the CORE DUMP data
including, the process memory page-by-page, based on the fpos (file position).
3) The threads could be released upon a close().

So the sub-problem here would be "How to hold these threads,collect the data
and release them non-disruptively?" in order to take a consistent dump.

As Roland had mentioned we could have a user option of having a minimal dump or
a full dump. The minimal dump can get a full register snapshot of the threads
running in user mode, and as much information as possible for those threads
that are blocked. Wheres a full dump can additionally get a memory dump as well.

If we provide the user a way to abort the operation, say keeping the threads in
an interruptible state, we should be able to prevent the doS attack which was
present in the method using the Freezer subsystem. For example we can send a
signal to the process and it should abort the dump operation and release the
threads.

We have analyzed the following options and we would like to know what people
think is the best or if there are any other mechanisms to perform the operation,
we would be happy to look at it.

1) Task work add

task_work_add() is an interface and an API. The task work add will run any
queued work before returning to user space from the kernel. So that work is
guaranteed to be done before user space can run again.

* Exploit this function to hold the threads when they are returning to the
user space.
* Wait until all the threads of the process to be dumped, reach task_work_add.
* Once all the threads have reached, the dump is taken and they are released.

Disadvantage :
* A thread which is blocked in kernel space,would not return to user space soon
and hence wouldn't be trapped in the task_work_add function
* The dump may be delayed as the other threads would be waiting for this
specific blocked thread to reach.

Solution:
* A way to solve this problem is to make the other threads that are waiting,
wait for a fixed time for the blocked thread and then just create a pt_note
with zeroes to indicate the presence of the blocked thread.

2) CRIU Approach :

This makes use of the CRIU tool and checkpoints when a dump is called, collects
the required details and continues the running process.
* A self dump cannot be initiated using the command line CRIU which is similar
to the limitation of gcore.
* A system call to do the same is being implemented which would help us create
a self dump.The system call is not upstream yet. We could explore that option as
well.

3) PTRACE (SEIZE + INTERRUPT) via kernel thread

In this approach, a kernel thread will play the role of seizing and registering
the states of the threads of the process to be dumped. We could make use of the
PTRACE_SEIZE + PTRACE_INTERRUPT within the open() to stop the threads without
SIGSTOP. However during self dump, we cannot make use of the PTRACE_SEIZE as a
self seize isn't permitted. One option is to offload this to a kernel thread
and let it capture the information. Once it is complete,the caller may be
released, so that it could continue with the dump.

* The open call reaches the kernel space during a self dump, a kernel thread
is spawned to seize all the threads of the process including the caller (the
process that called open) using a PTRACE_SEIZE.
* A PTRACE_INTERRUPT is issued and the required information is collected.
* On a self-dump, the kernel thread releases the caller, so that it can proceed
with the dumping.


APPENDIX:

[1] http://www.redhat.com/archives/utrace-devel/2009-July/msg00149.html
[2] http://www.redhat.com/archives/utrace-devel/2009-August/msg00006.html
[3] http://lwn.net/Articles/419756//

Thanking You.
With Regards,
Janani Venkataraman


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/