[RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

From: Frederic Weisbecker
Date: Fri Mar 06 2009 - 23:53:36 EST

Next message: Frederic Weisbecker: "[RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64"
Previous message: Frederic Weisbecker: "[RFC][PATCH 0/2] Syscalls tracing"
In reply to: Frederic Weisbecker: "[RFC][PATCH 0/2] Syscalls tracing"
Next in thread: Frederic Weisbecker: "Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The most important drawback that implies the syscall tracing is the variable
number of parameters that takes a syscall. Assuming there is a large number of
syscall, the goal is to provide as much as possible a generic layer on it.

What we want on tracing time (the hot path):

- Save the parameters and the number of the syscall. That's all. The rest of the
job can be done on the "output path".

On output time:

- Get the raw-binary saved parameters and the syscall number as well as some
usual infos such as the pid, the time...
- Depending on the tracing requirements, the user should have the choice to
dump the raw datas or a formatted version.

The core infrastructure has a static array which contains the necessary
informations for each syscall (ok, I've only implemented 4 for now).

When a syscall triggers, the tracer look at this array (O(1) access) and get the
number of parameters for this syscall. Then these parameters are saved in a
binary form on the ring buffer with the syscall number (a generic version, arch
independent).

On output time the parameters, the name of the syscall and very basic
informations to format the parameters are picket from the ring-buffer and the
generic syscall array.

Depending of the level of implementation we want for the syscall inside this
array, we can provide just a single mask where the n bit matches the n
parameter. If the bit is set, then the parameter will be considered as a string
pointer. Otherwise its raw hexadecimal value will be printed.

If needed, we can otherwise provide a specific callback to print the syscall
event.

The probem with these approaches comes from the fact that the string might have
disappeared from the user memory on output time. Even if we protect against
faults, it can give unreliable traces.
So perhaps we could think about copying the string on tracing time.

Concerning the return value, it doesn't make any problem, a single raw
value is printed for each one.

Note that the current implementation can be easily scaled further to give only
binary informations to the user and the matching metadata to decode it.

Signed-off-by: Frederic Weisbecker <fweisbec@xxxxxxxxx>
---
include/linux/ftrace.h | 44 +++++++
kernel/trace/Kconfig | 10 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 6 +
kernel/trace/trace.h | 32 +++++
kernel/trace/trace_syscalls.c | 284 +++++++++++++++++++++++++++++++++++++++++
6 files changed, 377 insertions(+), 0 deletions(-)
create mode 100644 kernel/trace/trace_syscalls.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e1583f2..1798692 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}

#endif /* CONFIG_HW_BRANCH_TRACER */

+/* Index for the ftrace syscalls array */
+enum syscall_trace_nr {
+ SYSCALL_TRACE_OPEN = 1,
+ SYSCALL_TRACE_CLOSE,
+ SYSCALL_TRACE_READ,
+ SYSCALL_TRACE_WRITE,
+
+ __SYSCALL_TRACE_END
+};
+
+/*
+ * A syscall entry in the ftrace syscalls array.
+ *
+ * @name: name of the syscall
+ * @nb_args: number of parameters it takes
+ * @simple_format: if true, we use string_mask, otherwise the print callback
+ * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
+ * will be displayed as a string, otherwise it will be considered
+ * as a raw number (hex displayed).
+ * @print: implement it if you want a custom output for a given syscall
+ */
+struct syscall_trace_entry {
+ const char *name;
+ int nb_args;
+ bool simple_format;
+ union {
+ unsigned long string_mask;
+ int (*print)(unsigned long *);
+ } output;
+};
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern short arch_syscall_trace_nr[];
+extern void start_ftrace_syscalls(void);
+extern void stop_ftrace_syscalls(void);
+extern void ftrace_syscall_enter(struct pt_regs *regs);
+extern void ftrace_syscall_exit(struct pt_regs *regs);
+#else
+static inline void start_ftrace_syscalls(void) { }
+static inline void stop_ftrace_syscalls(void) { }
+static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
+static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
+#endif
+
#endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8e4a2a6..95a0ad1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
config HAVE_HW_BRANCH_TRACER
bool

+config HAVE_FTRACE_SYSCALLS
+ bool
+
config TRACER_MAX_TRACE
bool

@@ -175,6 +178,13 @@ config EVENT_TRACER
allowing the user to pick and choose which trace point they
want to trace.

+config FTRACE_SYSCALLS
+ bool "Trace syscalls"
+ depends on HAVE_FTRACE_SYSCALLS
+ select TRACING
+ help
+ Basic tracer to catch the syscall entry and exit events.
+
config BOOT_TRACER
bool "Trace boot initcalls"
select TRACING
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c7a2943..c3feea0 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_EVENT_TRACER) += trace_events.o
obj-$(CONFIG_EVENT_TRACER) += events.o
obj-$(CONFIG_EVENT_TRACER) += trace_export.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index cc94f86..5152be3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;

+
+struct trace_array *__get_global_trace(void)
+{
+ return &global_trace;
+}
+
/**
* trace_wake_up - wake up tasks waiting for trace input
*
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2bfb7d1..9583830 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -30,6 +30,8 @@ enum trace_type {
TRACE_GRAPH_ENT,
TRACE_USER_STACK,
TRACE_HW_BRANCHES,
+ TRACE_SYSCALL_ENTER,
+ TRACE_SYSCALL_EXIT,
TRACE_KMEM_ALLOC,
TRACE_KMEM_FREE,
TRACE_POWER,
@@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
const void *ptr;
};

+struct syscall_trace_enter {
+ struct trace_entry ent;
+ enum syscall_trace_nr nr;
+ unsigned long args[];
+};
+
+struct syscall_trace_exit {
+ struct trace_entry ent;
+ enum syscall_trace_nr nr;
+ unsigned long ret;
+};
+
+
/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
@@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
TRACE_KMEM_ALLOC); \
IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
TRACE_KMEM_FREE); \
+ IF_ASSIGN(var, ent, struct syscall_trace_enter, \
+ TRACE_SYSCALL_ENTER); \
+ IF_ASSIGN(var, ent, struct syscall_trace_exit, \
+ TRACE_SYSCALL_EXIT); \
__ftrace_bad_type(); \
} while (0)

@@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
#endif /* CONFIG_FTRACE_STARTUP_TEST */

extern void *head_page(struct trace_array_cpu *data);
+extern struct trace_array *__get_global_trace(void);
extern long ns2usecs(cycle_t nsec);
extern int
trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
@@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
return test_tsk_trace_trace(task);
}

+#ifdef CONFIG_FTRACE_SYSCALLS
+extern enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags);
+extern enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags);
+#else
+static inline enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags) { }
+static inline enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags) { }
+#endif
+
/*
* trace_iterator_flags is an enumeration that defines bit
* positions into trace_flags that controls the output.
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
new file mode 100644
index 0000000..470bb9d
--- /dev/null
+++ b/kernel/trace/trace_syscalls.c
@@ -0,0 +1,284 @@
+#include <linux/kernel.h>
+#include <linux/ftrace.h>
+#include <asm/syscall.h>
+
+#include "trace_output.h"
+#include "trace.h"
+
+/* Create basic entry on syscall array (with the rudimentary string mask) */
+#define SYS_TRACE_ENTRY(sname, args, mask) \
+ {.name = #sname, .nb_args = args, .simple_format = true, \
+ .output.string_mask = mask}
+
+/*
+ * Create a custom defined entry on the syscall array, you will have to
+ * implement a callback to output the syscall.
+ */
+#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer) \
+ {.name = #sname, .nb_args = args, .simple_format = false, \
+ .print = printer}
+
+static const struct syscall_trace_entry syscall_trace_entries[] = {
+ /* For open, the first argument is a string, hence the given mask */
+ [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
+ [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
+ [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
+ [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
+};
+
+
+static atomic_t refcount;
+
+enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags)
+{
+ struct trace_seq *s = &iter->seq;
+ struct trace_entry *ent = iter->ent;
+ struct syscall_trace_enter *trace;
+ enum syscall_trace_nr syscall;
+ const struct syscall_trace_entry *entry;
+ char *str = NULL;
+ int i, ret;
+
+ trace_assign_type(trace, ent);
+
+ syscall = trace->nr;
+
+ if (!syscall || syscall >= __SYSCALL_TRACE_END)
+ return TRACE_TYPE_HANDLED;
+
+ entry = &syscall_trace_entries[syscall];
+ if (!entry->simple_format) {
+ ret = entry->output.print(trace->args);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ ret = trace_seq_printf(s, "syscall %s :", entry->name);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ for (i = 0; i < entry->nb_args; i++) {
+ /*
+ * FIXME: the string comes from the user, but on tracing
+ * output time, it may have disappeared from the memory.
+ * May be better to strdup it on tracing time.
+ */
+ if (entry->output.string_mask & (1 << i)) {
+ /* 128 is enough for most path */
+ str = strndup_user((char *)trace->args[i], 128);
+
+ /* If it faulted badly, the error shoud have been
+ * handled while servicing the syscall, just ignore.
+ */
+ if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
+ || str == ERR_PTR(-ENOMEM))
+ return TRACE_TYPE_HANDLED;
+
+ ret = trace_seq_printf(s, " %s", str);
+ } else {
+ ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
+ }
+ if (!ret)
+ goto end_partial;
+ }
+ if (!trace_seq_printf(s, "\n"))
+ goto end_partial;
+
+ kfree(str);
+ return TRACE_TYPE_HANDLED;
+
+end_partial:
+ kfree(str);
+ return TRACE_TYPE_PARTIAL_LINE;
+}
+
+enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags)
+{
+ struct trace_seq *s = &iter->seq;
+ struct trace_entry *ent = iter->ent;
+ struct syscall_trace_exit *trace;
+ enum syscall_trace_nr syscall;
+ const struct syscall_trace_entry *entry;
+ int ret;
+
+ trace_assign_type(trace, ent);
+
+ syscall = trace->nr;
+
+ if (!syscall || syscall >= __SYSCALL_TRACE_END)
+ return TRACE_TYPE_HANDLED;
+
+ entry = &syscall_trace_entries[syscall];
+
+ ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
+ trace->ret);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
+void start_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_inc_return(&refcount) != 1)
+ goto out;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+ atomic_dec(&refcount);
+}
+
+void stop_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_dec_return(&refcount))
+ goto out;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+ atomic_inc(&refcount);
+}
+
+void ftrace_syscall_enter(struct pt_regs *regs)
+{
+ struct trace_array *tr = __get_global_trace();
+ struct syscall_trace_enter *entry;
+ const struct syscall_trace_entry *sys_data;
+ struct ring_buffer_event *event;
+ struct trace_array_cpu *data;
+ int size;
+ int syscall_nr;
+ int nr_offset;
+ int cpu;
+
+ syscall_nr = syscall_get_nr(current, regs);
+ nr_offset = arch_syscall_trace_nr[syscall_nr];
+ if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+ return;
+
+ cpu = raw_smp_processor_id();
+ data = tr->data[cpu];
+
+ if (unlikely(atomic_read(&data->disabled)))
+ return;
+
+ sys_data = &syscall_trace_entries[nr_offset];
+ size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+
+ event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->nr = nr_offset;
+ syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+ trace_wake_up();
+}
+
+void ftrace_syscall_exit(struct pt_regs *regs)
+{
+ struct trace_array *tr = __get_global_trace();
+ struct syscall_trace_exit *entry;
+ const struct syscall_trace_entry *sys_data;
+ struct ring_buffer_event *event;
+ struct trace_array_cpu *data;
+ int syscall_nr;
+ int nr_offset;
+ int cpu;
+
+ syscall_nr = syscall_get_nr(current, regs);
+ nr_offset = arch_syscall_trace_nr[syscall_nr];
+ if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+ return;
+
+ cpu = raw_smp_processor_id();
+ data = tr->data[cpu];
+
+ if (unlikely(atomic_read(&data->disabled)))
+ return;
+
+ sys_data = &syscall_trace_entries[nr_offset];
+
+ event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
+ sizeof(*entry), 0, 0);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->nr = nr_offset;
+ entry->ret = syscall_get_return_value(current, regs);
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+ trace_wake_up();
+}
+
+static int init_syscall_tracer(struct trace_array *tr)
+{
+ start_ftrace_syscalls();
+
+ return 0;
+}
+
+static void reset_syscall_tracer(struct trace_array *tr)
+{
+ stop_ftrace_syscalls();
+}
+
+static struct trace_event syscall_enter_event = {
+ .type = TRACE_SYSCALL_ENTER,
+ .trace = print_syscall_enter,
+};
+
+static struct trace_event syscall_exit_event = {
+ .type = TRACE_SYSCALL_EXIT,
+ .trace = print_syscall_exit,
+};
+
+static struct tracer syscall_tracer __read_mostly = {
+ .name = "syscall",
+ .init = init_syscall_tracer,
+ .reset = reset_syscall_tracer
+};
+
+__init int register_ftrace_syscalls(void)
+{
+ int ret;
+
+ ret = register_ftrace_event(&syscall_enter_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_enter_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ ret = register_ftrace_event(&syscall_exit_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_exit_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ return register_tracer(&syscall_tracer);
+}
+device_initcall(register_ftrace_syscalls);
--
1.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Frederic Weisbecker: "[RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64"
Previous message: Frederic Weisbecker: "[RFC][PATCH 0/2] Syscalls tracing"
In reply to: Frederic Weisbecker: "[RFC][PATCH 0/2] Syscalls tracing"
Next in thread: Frederic Weisbecker: "Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]