Re: [PATCH v2 1/4] ftrace: Implement fs notification for preempt/irqsoff tracers

From: Viktor Rosendahl
Date: Sun May 05 2019 - 18:40:05 EST


On 5/4/19 6:47 PM, Joel Fernandes wrote:
> On Wed, May 01, 2019 at 10:36:47PM +0200, Viktor Rosendahl wrote:
>> immediately after each other in spite of the fact that the
>> preempt/irqsoff tracers operate in overwrite mode.
>>
>> Signed-off-by: Viktor Rosendahl <viktor.rosendahl@xxxxxxxxx>
>
> I agree with the general idea, but I don't really like how it is done in the
> patch.
>

Can you explain more precisely what you agree with?

The general idea of being able to trace bursts of latencies?

Or the slightly more specific idea of notifying user space when
/sys/kernel/debug/tracing/trace has new data, and let user space do the rest?

> We do have a notification mechanism already in the form of trace_pipe. Can we
> not improve that in some way to be notified of a new trace data? In theory,
> the trace_pipe does fit into the description in the documentation: "Reads
> from this file will block until new data is retrieved"
>

I am not quite sure what kind of solution you are after here. Could you be more
specific?

I think that it would be weird if we used trace_pipe to send a message with
the meaning "new data is available in the trace file". To me this would seem
like (re)inventing a special purpose alternative to inotify.

Another option would be to allow the user to consume the the latency traces
directly from trace_pipe, without using the trace file. This would make sense
to me and would indeed be a much better solution. If somebody is able to make
such an implementation I am all for it.

However, I do not know how to do this. I believe that it would lead towards a
significant rewrite of the latency tracers, probably also how the ftrace
buffering works.

My solution is clearly a hack but it has the benefits of requiring quite
small and straightforward changes to the kernel and the really ugly things are
then done in user space.

>>
>> + config PREEMPTIRQ_FSNOTIFY

> Does this have to be a CONFIG option? If prefer if the code automatically
> does the notification and it is always enabled. I don't see any drawbacks of
> that.
>

I was just thinking that sending fsnotify events to a sysfs file is a bit off
the beaten track and I figured that some people would like to opt out.

It can of course be changed so that it is always enabled when a tracer that
requires it is enabled, if there is general agreement on this point.

>> +
>> config SCHED_TRACER
>> bool "Scheduling Latency Tracer"
>> select GENERIC_TRACER
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index ca1ee656d6d8..ebefb8d4e072 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -44,6 +44,8 @@
>> #include <linux/trace.h>
>> #include <linux/sched/clock.h>
>> #include <linux/sched/rt.h>
>> +#include <linux/fsnotify.h>
>> +#include <linux/workqueue.h>
>>
>> #include "trace.h"
>> #include "trace_output.h"
>> @@ -8191,6 +8193,32 @@ static __init void create_trace_instances(struct dentry *d_tracer)
>> return;
>> }
>>
>> +#ifdef CONFIG_PREEMPTIRQ_FSNOTIFY
>> +
>> +static void trace_notify_workfn(struct work_struct *work)
> [snip]
>
> I prefer if this facility is available to other tracers as well such as
> the wakeup tracer which is similar in output (check
> Documentation/trace/ftrace.txt). I believe this should be a generic trace
> facility, and not tracer specific.
>

This should be possible.

If we stick with the fsnotify idea, it's possible to move the functions for it
to trace.c and to use it also from other tracers.

It is also possible to do a few small adjustments to the latency-collector and
to rename it the trace-collector, so that it can work for the wakeup case also.
The random sleep games probably don't make sense for the wakeup case but it's
already now a command line option.

Below is a suggestion for how to make the facility more generic and usable by
by both preempt/irqsoff and wakeup.

best regards,

Viktor
---
kernel/trace/Kconfig | 11 +++++++
kernel/trace/trace.c | 50 +++++++++++++++++++++++++++++--
kernel/trace/trace.h | 15 ++++++++++
kernel/trace/trace_irqsoff.c | 5 ++++
kernel/trace/trace_sched_wakeup.c | 7 ++++-
5 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8bd1d6d001d7..ca6a63004adf 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -244,6 +244,17 @@ config SCHED_TRACER
This tracer tracks the latency of the highest priority task
to be scheduled in, starting from the point it has woken up.

+config TRACE_FSNOTIFY
+ bool "Generate fsnotify events for the trace file"
+ default n
+ depends on (IRQSOFF_TRACER || PREEMPT_TRACER || SCHED_TRACER)
+ depends on FSNOTIFY
+ help
+ This option will enable the generation of fsnotify events for the
+ trace file. This makes it possible for userspace to be notified about
+ modification of /sys/kernel/debug/tracing/trace through the inotify
+ interface.
+
config HWLAT_TRACER
bool "Tracer to detect hardware latencies (like SMIs)"
select GENERIC_TRACER
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ca1ee656d6d8..1640fd302e9a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -44,6 +44,8 @@
#include <linux/trace.h>
#include <linux/sched/clock.h>
#include <linux/sched/rt.h>
+#include <linux/fsnotify.h>
+#include <linux/workqueue.h>

#include "trace.h"
#include "trace_output.h"
@@ -8191,6 +8193,50 @@ static __init void create_trace_instances(struct dentry *d_tracer)
return;
}

+#ifdef CONFIG_TRACE_FSNOTIFY
+
+static struct workqueue_struct *notify_wq;
+
+static void trace_notify_workfn(struct work_struct *work)
+{
+ struct trace_array *tr = container_of(work, struct trace_array,
+ notify_work);
+ fsnotify(tr->d_trace->d_inode, FS_MODIFY, tr->d_trace->d_inode,
+ FSNOTIFY_EVENT_INODE, NULL, 0);
+}
+
+static void trace_create_trace_file(struct trace_array *tr,
+ struct dentry *d_tracer)
+{
+ /* For notify we need to init the work structure and save the pointer */
+ INIT_WORK(&tr->notify_work, trace_notify_workfn);
+ tr->d_trace = trace_create_file("trace", 0644, d_tracer, tr,
+ &tracing_fops);
+}
+
+static __init void trace_file_notify_init(void)
+{
+ notify_wq = alloc_workqueue("irqsoff_notify_wq",
+ WQ_UNBOUND | WQ_HIGHPRI, 0);
+ if (!notify_wq)
+ pr_err("Unable to allocate irqsoff_notify_wq");
+}
+
+void trace_file_notify(struct trace_array *tr)
+{
+ if (likely(notify_wq))
+ queue_work(notify_wq, &tr->notify_work);
+}
+
+#else /* !CONFIG_TRACE_FSNOTIFY */
+
+#define trace_create_trace_file(tr, d_tracer) \
+ trace_create_file("trace", 0644, d_tracer, tr, &tracing_fops)
+
+#define trace_file_notify_init() do {} while (0)
+
+#endif /* !CONFIG_TRACE_FSNOTIFY */
+
static void
init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
{
@@ -8209,8 +8255,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("trace_options", 0644, d_tracer,
tr, &tracing_iter_fops);

- trace_create_file("trace", 0644, d_tracer,
- tr, &tracing_fops);
+ trace_create_trace_file(tr, d_tracer);

trace_create_file("trace_pipe", 0444, d_tracer,
tr, &tracing_pipe_fops);
@@ -8415,6 +8460,7 @@ static __init int tracer_init_tracefs(void)
{
struct dentry *d_tracer;

+ trace_file_notify_init();
trace_access_lock_init();

d_tracer = tracing_init_dentry();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d80cee49e0eb..46f20c9bcf36 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -17,6 +17,7 @@
#include <linux/compiler.h>
#include <linux/trace_seq.h>
#include <linux/glob.h>
+#include <linux/workqueue.h>

#ifdef CONFIG_FTRACE_SYSCALLS
#include <asm/unistd.h> /* For NR_SYSCALLS */
@@ -302,6 +303,10 @@ struct trace_array {
struct dentry *options;
struct dentry *percpu_dir;
struct dentry *event_dir;
+#ifdef CONFIG_TRACE_FSNOTIFY
+ struct dentry *d_trace;
+ struct work_struct notify_work;
+#endif
struct trace_options *topts;
struct list_head systems;
struct list_head events;
@@ -774,6 +779,16 @@ int trace_pid_write(struct trace_pid_list *filtered_pids,
struct trace_pid_list **new_pid_list,
const char __user *ubuf, size_t cnt);

+#ifdef CONFIG_TRACE_FSNOTIFY
+
+void trace_file_notify(struct trace_array *tr);
+
+#else /* !CONFIG_TRACE_FSNOTIFY */
+
+#define trace_file_notify(tr) do {} while (0)
+
+#endif /* !CONFIG_TRACE_FSNOTIFY */
+
#ifdef CONFIG_TRACER_MAX_TRACE
void update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu,
void *cond_data);
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index a745b0cee5d3..911bb635fb4e 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -323,6 +323,7 @@ check_critical_timing(struct trace_array *tr,
u64 T0, T1, delta;
unsigned long flags;
int pc;
+ bool notify = false;

T0 = data->preempt_timestamp;
T1 = ftrace_now(cpu);
@@ -353,6 +354,7 @@ check_critical_timing(struct trace_array *tr,
if (likely(!is_tracing_stopped())) {
tr->max_latency = delta;
update_max_tr_single(tr, current, cpu);
+ notify = true;
}

max_sequence++;
@@ -364,6 +366,9 @@ check_critical_timing(struct trace_array *tr,
data->critical_sequence = max_sequence;
data->preempt_timestamp = ftrace_now(cpu);
__trace_function(tr, CALLER_ADDR0, parent_ip, flags, pc);
+
+ if (unlikely(notify))
+ trace_file_notify(tr);
}

static nokprobe_inline void
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 743b2b520d34..c87e901108b9 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -437,6 +437,8 @@ probe_wakeup_sched_switch(void *ignore, bool preempt,
long disabled;
int cpu;
int pc;
+ bool report = false;
+ struct trace_array *tr = wakeup_trace;

tracing_record_cmdline(prev);

@@ -481,7 +483,8 @@ probe_wakeup_sched_switch(void *ignore, bool preempt,
T1 = ftrace_now(cpu);
delta = T1-T0;

- if (!report_latency(wakeup_trace, delta))
+ report = report_latency(wakeup_trace, delta);
+ if (!report)
goto out_unlock;

if (likely(!is_tracing_stopped())) {
@@ -495,6 +498,8 @@ probe_wakeup_sched_switch(void *ignore, bool preempt,
local_irq_restore(flags);
out:
atomic_dec(&per_cpu_ptr(wakeup_trace->trace_buffer.data, cpu)->disabled);
+ if (unlikely(report))
+ trace_file_notify(tr);
}

static void __wakeup_reset(struct trace_array *tr)
--
2.17.1