Re: [RFC/PATCHSET 0/6] perf kmem: Implement page allocation analysis (v1)

From: Ingo Molnar
Date: Thu Mar 12 2015 - 06:41:31 EST


* Namhyung Kim <namhyung@xxxxxxxxxx> wrote:

> Hello,
>
> Currently perf kmem command only analyzes SLAB memory allocation. And
> I'd like to introduce page allocation analysis also. Users can use
> --slab and/or --page option to select it. If none of these options
> are used, it does slab allocation analysis for backward compatibility.
>
> The patch 1-3 are bugfix and cleanups. Patch 4 implements basic
> support for page allocation analysis, patch 5 deals with the callsite
> and finally patch 6 implements sorting.
>
> In this patchset, I used two kmem events: kmem:mm_page_alloc and
> kmem_page_free for analysis as they can track every memory
> allocation/free path AFAIK. However, unlike slab tracepoint events,
> those page allocation events don't provide callsite info directly. So
> I recorded callchains and extracted callsites like below:

Really cool features!

I have a couple of output typography observations:

> Normal page allocation callchains look like this:
>
> 360a7e __alloc_pages_nodemask
> 3a711c alloc_pages_current
> 357bc7 __page_cache_alloc <-- callsite
> 357cf6 pagecache_get_page
> 48b0a prepare_pages
> 494d3 __btrfs_buffered_write
> 49cdf btrfs_file_write_iter
> 3ceb6e new_sync_write
> 3cf447 vfs_write
> 3cff99 sys_write
> 7556e9 system_call
> f880 __write_nocancel
> 33eb9 cmd_record
> 4b38e cmd_kmem
> 7aa23 run_builtin
> 27a9a main
> 20800 __libc_start_main
>
> But first two are internal page allocation functions so it should be
> skipped. To determine such allocation functions, I used following regex:
>
> ^_?_?(alloc|get_free|get_zeroed)_pages?
>
> This gave me a following list of functions (you can see this with -v):
>
> alloc func: __get_free_pages
> alloc func: get_zeroed_page
> alloc func: alloc_pages_exact
> alloc func: __alloc_pages_direct_compact
> alloc func: __alloc_pages_nodemask
> alloc func: alloc_page_interleave
> alloc func: alloc_pages_current
> alloc func: alloc_pages_vma
> alloc func: alloc_page_buffers
> alloc func: alloc_pages_exact_nid
>
> After skipping those function, it got '__page_cache_alloc'.
>
> Other information such as allocation order, migration type and gfp
> flags are provided by tracepoint events.
>
> Basically the output will be sorted by total allocation bytes, but you
> can change it by using -s/--sort option. The following sort keys are
> added to support page analysis: page, order, mtype, gfp. Existing
> 'callsite', 'bytes' and 'hit' sort keys also can be used.
>
> An example follows:
>
> # perf kmem record --slab --page sleep 1
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ]
>
> # perf kmem stat --page --caller -l 10 -s order,hit
>
> --------------------------------------------------------------------------------------------
> Total_alloc/Per | Hit | Order | Migrate type | GFP flag | Callsite

s/Per/Size
s/Hit/Hits
s/Migrate type/Migration type
s/GFP flag/GFP flags

?

> --------------------------------------------------------------------------------------------
> 65536/16384 | 4 | 2 | RECLAIMABLE | 00285250 | new_slab
> 51347456/4096 | 12536 | 0 | MOVABLE | 0102005a | __page_cache_alloc
> 53248/4096 | 13 | 0 | UNMOVABLE | 002084d0 | pte_alloc_one
> 40960/4096 | 10 | 0 | MOVABLE | 000280da | handle_mm_fault
> 28672/4096 | 7 | 0 | UNMOVABLE | 000000d0 | __pollwait
> 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_wp_page
> 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_cow_fault
> 16384/4096 | 4 | 0 | UNMOVABLE | 00000200 | __tlb_remove_page
> 16384/4096 | 4 | 0 | UNMOVABLE | 000084d0 | __pmd_alloc
> 8192/4096 | 2 | 0 | UNMOVABLE | 000084d0 | __pud_alloc
> ... | ... | ... | ... | ... | ...
> --------------------------------------------------------------------------------------------
>
> SUMMARY (page allocator)
> ========================
> Total alloc requested: 12593
> Total alloc failure : 0
> Total bytes allocated: 51630080
> Total free requested: 115
> Total free unmatched: 67
> Total bytes freed : 471040

I'd suggest the following changes to the format:

- Collapse stats into 3 groups: 'allocated+freed', 'allocated only',
'freed only', depending on how much of their lifetime we've
managed to trace. These groups are really distinct and it makes
little sense to mix up their stats.

- Add commas to the numbers, to make it easier to read and compare
larger numbers.

- Right-align the numbers, to make them easy to compare when they
are placed under each other.

- Merge the 'count' and 'bytes' stats into a single line, so that
it's more compact, easier to navigate, but also only comparable
type numbers are placed under each other.

I.e. something like this (mockup) output:

SUMMARY (page allocator)
========================

Pages allocated+freed: 12,593 [ 51,630,080 bytes ]

Pages allocated-only: 2,342 [ 1,235,010 bytes ]
Pages freed-only: 67 [ 135,311 bytes ]

Page allocation failures : 0


> Order UNMOVABLE RECLAIMABLE MOVABLE RESERVED CMA/ISOLATE
> ----- ------------ ------------ ------------ ------------ ------------
> 0 32 0 12557 0 0
> 1 0 0 0 0 0
> 2 0 4 0 0 0
> 3 0 0 0 0 0
> 4 0 0 0 0 0
> 5 0 0 0 0 0
> 6 0 0 0 0 0
> 7 0 0 0 0 0
> 8 0 0 0 0 0
> 9 0 0 0 0 0
> 10 0 0 0 0 0

Here I'd suggest the following refinements:

- Use '.' instead of '0', to make actual nonzero values stand out
visually, while still keeping a tabular format

- Merge the 'Reserved', 'CMA/Isolate' columns into a single 'Special'
colum: this will be zero in 99.9% of the cases, as those pages
mostly deal with driver interfaces, mostly used during init/deinit.

- Capitalize less.

- Use comma-separated numbers for better readability.

So something like this:


Order Unmovable Reclaimable Movable Special
----- ------------ ------------ ------------ ------------
0 32 . 12,557 .
1 . . . .
2 . 4 . .
3 . . . .
4 . . . .
5 . . . .
6 . . . .
7 . . . .
8 . . . .
9 . . . .
10 . . . .


Look for example how easily noticeable the '4' value is now, while it
was pretty easy to miss in the original table.

> I have some idea how to improve it. But I'd also like to hear other
> idea, suggestion, feedback and so on.

So there's one thing that would be useful: to track pages allocated on
one node, but freed on another. Those kinds of allocation/free
patterns are especially expensive and might make sense to visualize.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/