Re: [tip:perfcounters/core] perf_counter: x86: Fix call-chainsupport to use NMI-safe methods

From: Ingo Molnar
Date: Mon Jun 15 2009 - 16:26:51 EST



* Ingo Molnar <mingo@xxxxxxx> wrote:

> Which gave these overall stats:
>
> Performance counter stats for './prctl 0 0':
>
> 28414.696319 task-clock-msecs # 0.997 CPUs
> 3 context-switches # 0.000 M/sec
> 1 CPU-migrations # 0.000 M/sec
> 149 page-faults # 0.000 M/sec
> 87254432334 cycles # 3070.750 M/sec
> 5078691161 instructions # 0.058 IPC
> 304144 cache-references # 0.011 M/sec
> 28760 cache-misses # 0.001 M/sec
>
> 28.501962853 seconds time elapsed.
>
> 87254432334/1000000000 ~== 87, so we have 87 cycles cost per
> iteration.

I also measured the GUP based copy_from_user_nmi(), on 64-bit (so
there's not even any real atomic-kmap/invlpg overhead):

Performance counter stats for './prctl 0 0':

55580.513882 task-clock-msecs # 0.997 CPUs
3 context-switches # 0.000 M/sec
1 CPU-migrations # 0.000 M/sec
149 page-faults # 0.000 M/sec
176375680192 cycles # 3173.337 M/sec
299353138289 instructions # 1.697 IPC
3388060 cache-references # 0.061 M/sec
1318977 cache-misses # 0.024 M/sec

55.748468367 seconds time elapsed.

This shows the overhead of looking up pagetables - 176 cycles per
iteration. A cr2 save/restore pair is twice as fast.

Here's the profile btw:

aldebaran:~> perf report -s s

#
# (1813480 samples)
#
# Overhead Symbol
# ........ ......
#
23.99% [k] __get_user_pages_fast
19.89% [k] gup_pte_range
18.98% [k] gup_pud_range
16.95% [k] copy_from_user_nmi
16.04% [k] put_page
3.17% [k] sys_prctl
0.02% [k] _spin_lock
0.02% [k] copy_user_generic_string
0.02% [k] get_page_from_freelist

taking a look at 'perf annotate __get_user_pages_fast' suggests
these two hot-spots:

0.04 : ffffffff810310cc: 9c pushfq
9.24 : ffffffff810310cd: 41 5d pop %r13
1.43 : ffffffff810310cf: fa cli
3.44 : ffffffff810310d0: 48 89 fb mov %rdi,%rbx
0.00 : ffffffff810310d3: 4d 8d 7e ff lea -0x1(%r14),%r15
0.00 : ffffffff810310d7: 48 c1 eb 24 shr $0x24,%rbx
0.00 : ffffffff810310db: 81 e3 f8 0f 00 00 and $0xff8,%ebx

15% of its overhead is here, 50% is here:

0.71 : ffffffff81031141: 41 55 push %r13
0.05 : ffffffff81031143: 9d popfq
30.07 : ffffffff81031144: 8b 55 d4 mov -0x2c(%rbp),%edx
2.78 : ffffffff81031147: 48 83 c4 20 add $0x20,%rsp
0.00 : ffffffff8103114b: 89 d0 mov %edx,%eax
10.93 : ffffffff8103114d: 5b pop %rbx
0.02 : ffffffff8103114e: 41 5c pop %r12
1.28 : ffffffff81031150: 41 5d pop %r13
0.51 : ffffffff81031152: 41 5e pop %r14

So either pushfq+cli...popfq sequences are a lot more expensive on
Nehalem as i imagined, or instruction skidding is tricking us here.

gup_pte_range has a clear hotspot with a locked instruction:

2.46 : ffffffff81030d88: 48 8d 41 08 lea 0x8(%rcx),%rax
0.00 : ffffffff81030d8c: f0 ff 41 08 lock incl 0x8(%rcx)
53.52 : ffffffff81030d90: 49 63 01 movslq (%r9),%rax
0.00 : ffffffff81030d93: 48 81 c6 00 10 00 00 add $0x1000,%rsi

11% of the total overhead - or about 19 cycles.

So it seems cr2+direct-access is distinctly faster than fast-gup.

And fast-gup overhead is _per frame entry_ - which makes
cr2+direct-access (which is per NMI) _far_ more performant - a dozen
or more call-chain entries are the norm.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/