[PATCH v3 0/8]Perf: Make the 'perf top -p $pid' can perceive the new forked threads.

From: chenggang
Date: Wed Mar 13 2013 - 05:44:43 EST

From: chenggang <chenggang.qcg@xxxxxxxxxx>

This patch set base on the 3.8.rc7 kernel.

Here is the version 3, I optimized the performance and structure in this version.

This patch set add a function that make the 'perf top -p $pid' is able to perceive
the new threads that is forked by target processes. 'perf top{record} -p $pid' can
perceive the threads are forked before we execute perf, but it cannot perceive the
new threads are forked after we started perf. This is perf's important defect, because
the applications who will fork new threads on-the-fly are very much.
For performance reasons, the event inherit mechanism is forbidden while we use per-task
counters. Some internal data structures, such as, thread_map, evlist->mmap, evsel->fd,
evsel->id, evsel->sample_id are implemented as arrays at the initialization phase.
Their size is fixed, and they cannot be extended easily while we want to expend them
for new forked threads.

So, we have done the following work:
1) Transformed thread_map to linked list.
Implemented the interfaces to extand and shrink a exist thread_map.
2) Transformed xyarray to linked list. Implementd the interfaces to extand and shrink
a exist xyarray.
The xyarray is a 2-dimensional structure.
The x-dimension is cpus, and the x-dimension is a array still.
The y-dimension is threads of interest, and the y-dimension are linked list.
3) Implemented evlist->mmap, evsel->fd, evsel->id and evsel->sample_id with the new xyarray.
Implemented interfaces to expand and shrink these structures.
4) Added 2 callback functions to top->perf_tool, they are called while the PERF_RECORD_FORK
& PERF_RECORD_EXIT events are got.
While a PERF_RECORD_FORK event is got, all related data structures are expanded, a new
fd and mmap are opened.
While a PERF_RECORD_EXIT event is got, all nodes in the related data structures are

The linked list is flexible, list_add & list_del can be used easily. Additional, performance
penalty (especially the CPU utilization) is low.

At the last of this coverletter, I attached a test program and its Makefile. After it is
executed, we will get its pid. Then, use this command:
'perf top -p *pid*'
The perf top will perceive the functions that called by the threads forked on-the-fly.
We could use 'top' tool to monitor the overhead of 'perf'. The result shows the cpu overhead
of this patch set is less than 3%. I think this overhead can be accepted.

My test environment is as follows:
# ========
# captured on: Wed Mar 13 15:23:55 2013
# perf version : 3.8.rc7.ga39f52
# arch : x86_64
# nrcpus online : 2
# nrcpus avail : 2
# cpudesc : Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz
# cpuid : GenuineIntel,6,23,10
# total memory : 3034932 kB

This function has been already implemented for 'perf top -p $pid' in the patch
[8/8] of this patch set. Next step, the 'perf record -p $pid' should be modified
with the same method.

Thanks for David Ahern's suggestion.

Cc: David Ahern <dsahern@xxxxxxxxx>
Cc: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Cc: Paul Mackerras <paulus@xxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxxxxxxxxxx>
Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
Cc: Namhyung Kim <namhyung@xxxxxxxxx>
Cc: Yanmin Zhang <yanmin.zhang@xxxxxxxxx>
Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Chenggang Qin <chenggang.qcg@xxxxxxxxxx>

chenggang (8):
changed thread_map to list
changed xyarray to list
hanged mmap to xyarray
changed evsel->id to xyarray
extend mechanism for evsel->id & evsel->fd
add some operations for mmap
changed the method to traverse mmap list
fork & exit event perceived

tools/perf/Makefile | 3 +-
tools/perf/builtin-record.c | 8 +-
tools/perf/builtin-stat.c | 2 +-
tools/perf/builtin-top.c | 116 ++++++++++++-
tools/perf/tests/mmap-basic.c | 4 +-
tools/perf/tests/open-syscall-tp-fields.c | 9 +-
tools/perf/tests/perf-record.c | 7 +-
tools/perf/util/event.c | 12 +-
tools/perf/util/evlist.c | 206 +++++++++++++++++++---
tools/perf/util/evlist.h | 14 +-
tools/perf/util/evsel.c | 118 +++++++++++--
tools/perf/util/evsel.h | 13 +-
tools/perf/util/header.c | 28 +--
tools/perf/util/header.h | 3 +-
tools/perf/util/python.c | 6 +-
tools/perf/util/thread_map.c | 265 +++++++++++++++++++++--------
tools/perf/util/thread_map.h | 16 +-
tools/perf/util/xyarray.c | 125 +++++++++++++-
tools/perf/util/xyarray.h | 68 +++++++-
19 files changed, 866 insertions(+), 157 deletions(-)

Here is a program to test the patch set.

#include <time.h>
#include <stdio.h>
#include <pthread.h>
#include <math.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <fcntl.h>

#define CHILDREN_NUM 15000
#define UINT_MAX (~0U)

unsigned int new_rand(unsigned int min, unsigned int max)
int fd;
unsigned int n = 0;

fd = open("/dev/urandom", O_RDONLY);

if (fd > 0) {
read(fd, &n, sizeof (n));

return (unsigned int)((double)n / UINT_MAX * (max - min) + min);

pid_t gettid(void)
return syscall(SYS_gettid);

static inline unsigned long long rdclock(void)
struct timespec ts;

clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;

int do_pi(int p){
double mypi,h,sum,x;
long long n,i;

double cost_time;
unsigned int exec_time;
unsigned long long start, end;

int ret;
pthread_t new_thread_id;

printf("new thread[%d]: %d tid: %d ppid: %d\n", getpid(), p, gettid(), getppid());

exec_time = new_rand(50, 10000000);
start = rdclock();

while(1) {
n = 5000;
h = 1.0/n;

for (i = 1; i <= n; i+=1 ) {
x = h * ( i - 0.5 ) ;
sum += 4.0 / ( 1.0 + pow(x,2) ) ;

mypi = h * sum;

end = rdclock();

cost_time = (double)(end-start) / 1e3;
if (cost_time > (double) exec_time) //microsecond

return 0;

int main()
int i=0, ret=0;
int j;

pthread_t id[CHILDREN_NUM];
pthread_t id2[CHILDREN_NUM];

printf("pid: %d\n", getpid());


for(j=0; j<CHILDREN_NUM; j++){
ret = pthread_create(id+j, NULL, (void*)do_pi, j);
if (ret){
printf("Create pthread error!\n");
return 1;
usleep(new_rand(500, 1000));

for(j=0; j<CHILDREN_NUM; j++)
pthread_join(id[j], NULL);

return 0;

If the filename of the last program file is "thread", follow is the Makefile for it.

EXEC = thread

OBJS = thread.o


CC = gcc

INC = -I. -I/usr/include

CFLAGS = ${INC} -L/usr/lib/x86_64-linux-gnu -lpthread -g -ldl -lrt

${EXEC} : ${OBJS}
${CC} -o $@ ${OBJS} ${CFLAGS} ${LDFLAGS}


.PHONY : clean
clean :
rm -f ${OBJS} ${EXEC}

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/