Re: [PATCH v2] Add /proc/pid_gen

From: Tim Murray
Date: Wed Nov 21 2018 - 21:36:05 EST


On Wed, Nov 21, 2018 at 5:29 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, 21 Nov 2018 17:08:08 -0800 Daniel Colascione <dancol@xxxxxxxxxx> wrote:
>
> > Have you done much
> > retrospective long trace analysis?
>
> No. Have you?
>
> Of course you have, which is why I and others are dependent upon you to
> explain why this change is worth adding to Linux. If this thing solves
> a problem which we expect will not occur for anyone between now and the
> heat death of the universe then this impacts our decisions.

I use ftrace the most on Android, so let me take a shot.

In addition to the normal "debug a slow thing" use cases for ftrace,
Android has started exploring two other ways of using ftrace:

1. "Flight recorder" mode: trigger ftrace for some amount of time when
a particular anomaly is detected to make debugging those cases easier.

2. Long traces: let a trace stream to disk for hours or days, then
postprocess it to get some deeper insights about system behavior.
We've used this very successfully to debug and optimize power
consumption.

Knowing the initial state of the system is a pain for both of these
cases. For example, one of the things I'd like to know in some of my
current use cases for long traces is the current oom_score_adj of
every process in the system, but similar to PID reuse, that can change
very quickly due to userspace behavior. There's also a race between
reading that value in userspace and writing it to trace_marker:

1. Userspace daemon X reads oom_score_adj for a process Y.
2. Process Y gets a new oom_score_adj value, triggering the
oom/oom_score_adj_update tracepoint.
3. Daemon X writes the old oom_score_adj value to trace_marker.

As I was writing this, though, I realized that the race doesn't matter
so long as our tools follow the same basic practice (for PID reuse,
oom_score_adj, or anything else we need):

1. Daemon enables all requested tracepoints and resets the trace clock.
2. Daemon enables tracing.
3. Daemon dumps initial state for any tracepoint we care about.
4. When postprocessing, a tool must consider the initial state of a
value (eg, oom_score_adj of pid X) to be either the initial state as
reported by the daemon or the first ftrace event reporting that value.
If there is an ftrace event in the trace before the report from the
daemon, the report from the daemon should be ignored.

The key here is that initial state as reported by userspace needs to
provable from ftrace events. For example, if we stream ps -AT to
trace_marker from userspace, we should be able to prove that pid 5000
in that ps -AT is actually the same process that shows up as pid 5000
later on in the trace and that it has not been replaced by some other
pid 5000. That requires that any event that could break that
assumption be available from the trace itself. Accordingly, I think a
PID reuse tracepoint would work better than an atomic dump of all PIDs
because I'd rather have tracepoints for anything where the initial
state of the system matters than relying on different atomic dumps to
be sure of the initial state. (in this case, we'd have to combine a
PID reuse tracepoint with sched_process_fork and task_rename or
something like that to know what's actually running, but that's a
tractable problem)

The PID reuse tracepoint requires more intelligence in postprocessing
and it still has a race where the state of these values can be
indeterminate at the beginning of a trace if those values change
quickly, but I don't think we can get to a point where we can generate
a full snapshot of every tracepoint we care about in the system at the
start of a trace. For Android's use cases, that short race at the
beginning of a trace isn't a big deal (or at least I can't think of a
case where it would be).