Re: [PATCH] perf record: handle death by SIGTERM

From: David Ahern
Date: Wed May 08 2013 - 09:48:49 EST


On 5/8/13 12:54 AM, Ingo Molnar wrote:

* David Ahern <dsahern@xxxxxxxxx> wrote:

On 5/7/13 12:29 AM, Ingo Molnar wrote:

* Stephane Eranian <eranian@xxxxxxxxxx> wrote:

This is a good fix. I have run into this infinite loop in perf report
many times.

Hm, perf record should really not assume much about the perf.data and
should avoid infinite loops ...

So while making perf.data more consistent on SIGTERM is a nice fix, perf
report should be fixed as well to detect loops and such.

Thanks,

Ingo


This seems to do the trick:

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 326068a..e82646f 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -2802,6 +2802,17 @@ int perf_session__read_header(struct
perf_session *session, int fd)
if (perf_file_header__read(&f_header, header, fd) < 0)
return -EINVAL;

+ /*
+ * sanity check that perf.data was written cleanly: data size
+ * is initialized to 0 and updated only if the on_exit function
+ * is run. If data size is still 0 then the file cannot be
+ * processed.
+ */
+ if (f_header.data.size == 0) {
+ pr_err("data size is 0. Was record properly terminated?\n");
+ return -1;
+ }

Hm, this detects the condition - but where does the looping come from?

Can it happen with a perf.data that 'seems' clean but is corrupted
(because not fully written, buggy kernel just crashed, etc.).

In essence it would be _very_ nice if someone reproduced the looping and
checked what to do to fix the looping itself. Or does the above
data.size == 0 check fully fix the looping under every possible state of a
perf.data?

I think so. If you want the dirty details here you go.

The looping is in __perf_session__process_events. When the data file is not closed properly data_size is 0 and n my case data_offset is 288. Dropping into this function:

page_offset = page_size * (data_offset / page_size);
file_offset = page_offset;
head = data_offset - page_offset;

which means

page_offset = 0
file_offset = 0
head = 288

The looping is here:

remap:
buf = mmap(NULL, mmap_size, mmap_prot, mmap_flags, session->fd,
file_offset);
if (buf == MAP_FAILED) {
pr_err("failed to mmap file\n");
err = -errno;
goto out_err;
}
mmaps[map_idx] = buf;
map_idx = (map_idx + 1) & (ARRAY_SIZE(mmaps) - 1);
file_pos = file_offset + head;

more:
event = fetch_mmaped_event(session, head, mmap_size, buf);

--> returned event is NULL

if (!event) {
if (mmaps[map_idx]) {
munmap(mmaps[map_idx], mmap_size);
mmaps[map_idx] = NULL;
}

page_offset = page_size * (head / page_size);
file_offset += page_offset;
head -= page_offset;

--> head is 288 which means the new page_offset is 0 and the new file_offset is 0. head never changes. and then we go back to remap.

goto remap;
}

So, if you want to handle the looping then seeing that page_offset new in the above is 0 would suffice. A 0 value means file_offset does not change and the jump to remap means the mmap does not change. ie., in a loop where no values are changing.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/