Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

From: Yunlong Song
Date: Sat Apr 25 2015 - 10:04:23 EST


On 2015/4/24 21:49, Yunlong Song wrote:
> On 2015/4/24 21:31, Yunlong Song wrote:
>> [Profiling Background]
>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>> we find a case that ext4 is better than f2fs in random write under the test of
>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>> and since the specific 1% case takes up 60% time, the total number of records of its
>> stack should also takes up 60% of the total records, then we can recognize those stacks
>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>
>> [Profiling Problem]
>>
>> Although perf can record the events (with call stack) of a specified pid, e.g. using
>> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
>> and the CPU is scheduled to other process. As a result, perf will not record any events
>> of iozone until iozone's context is recovered and the CPU is scheduled to continue
>> processing the sys_write of iozone.
>
>> This obeys our initial idea which is described in [Profiling Background],
>
> This "disobeys" our initial idea which is described in [Profiling Background],
>
> since we cannot recognize the call stacks of the specific 1% case
>> by using the ratio of the record number.
>>
>> [Alternative Solution without Perf]
>> We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
>> no matter whether iozone is interrupted or not. However, we have not taken a deep sight
>> into this, since we want to use perf to do this kind of thing.
>>
>> [Question about Perf]
>> So we have a question that "How does perf still record the stack of a specified pid even
>> when that process is interrupted and CPU is scheduled to other process?"
>>
>
>

Here is another example:

$ cat test.c
void func_loop(void)
{
int i, sum = 0;
for (i = 0; i < 100000; i++)
sum += 1;
}

void func_sleep(void)
{
sleep(1);
}

int main(void)
{
while (1) {
func_loop();
func_sleep();
}
return 0;
}

$ gcc test.c -o test -O0
$ ./test
$ ps aux | grep test
Yunlong+ 133798 0.0 0.0 4056 348 pts/7 S+ 21:52 0:00 ./test
$ perf record -g -p 133798 sleep 5
$ perf report -g --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 16 of event 'cycles'
# Event count (approx.): 2439331
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................. ...............................
#
79.46% 76.48% test test [.] func_loop
|
---func_loop
|
|--97.07%-- main
| __libc_start_main
--2.93%-- [...]

77.14% 0.00% test libc-2.19.so [.] __libc_start_main
|
---__libc_start_main

77.14% 0.00% test test [.] main
|
---main
__libc_start_main

20.54% 0.00% test [unknown] [.] 0000000000000000
|
---0

20.54% 0.00% test libc-2.19.so [.] __GI___libc_nanosleep
|
---__GI___libc_nanosleep
0

20.54% 0.00% test [unknown] [k] 0x00007f2fa07ef6b0
|
---0x7f2fa07ef6b0
__GI___libc_nanosleep
0

20.54% 0.00% test [kernel.kallsyms] [k] system_call_fastpath
|
---system_call_fastpath
0x7f2fa07ef6b0
__GI___libc_nanosleep

As shown above, the func_loop is regarded as the main operation and its call stack
is showed on the top of perf report. However, we know that func_sleep is really the
main operation from the view of time duration indeed. And if we want to its actual
call stack in the real time, we can see the result as following:

$ while true ; do cat /proc/133798/stack ; sleep 0.1 ; done
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff

As shown above, func_sleep is regarded as the main operation from the view of time and
its call stack is printed in real time.

--
Thanks,
Yunlong Song

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/