Re: [RFC/PATCHSET 0/6] perf kmem: Implement page allocation analysis (v1)

From: Namhyung Kim
Date: Thu Mar 12 2015 - 10:59:46 EST


Hi Ingo,

On Thu, Mar 12, 2015 at 11:41:19AM +0100, Ingo Molnar wrote:
> * Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> > Hello,
> >
> > Currently perf kmem command only analyzes SLAB memory allocation. And
> > I'd like to introduce page allocation analysis also. Users can use
> > --slab and/or --page option to select it. If none of these options
> > are used, it does slab allocation analysis for backward compatibility.
> >
> > The patch 1-3 are bugfix and cleanups. Patch 4 implements basic
> > support for page allocation analysis, patch 5 deals with the callsite
> > and finally patch 6 implements sorting.
> >
> > In this patchset, I used two kmem events: kmem:mm_page_alloc and
> > kmem_page_free for analysis as they can track every memory
> > allocation/free path AFAIK. However, unlike slab tracepoint events,
> > those page allocation events don't provide callsite info directly. So
> > I recorded callchains and extracted callsites like below:
>
> Really cool features!

Thanks!


>
> I have a couple of output typography observations:
>
> > Normal page allocation callchains look like this:
> >
> > 360a7e __alloc_pages_nodemask
> > 3a711c alloc_pages_current
> > 357bc7 __page_cache_alloc <-- callsite
> > 357cf6 pagecache_get_page
> > 48b0a prepare_pages
> > 494d3 __btrfs_buffered_write
> > 49cdf btrfs_file_write_iter
> > 3ceb6e new_sync_write
> > 3cf447 vfs_write
> > 3cff99 sys_write
> > 7556e9 system_call
> > f880 __write_nocancel
> > 33eb9 cmd_record
> > 4b38e cmd_kmem
> > 7aa23 run_builtin
> > 27a9a main
> > 20800 __libc_start_main
> >
> > But first two are internal page allocation functions so it should be
> > skipped. To determine such allocation functions, I used following regex:
> >
> > ^_?_?(alloc|get_free|get_zeroed)_pages?
> >
> > This gave me a following list of functions (you can see this with -v):
> >
> > alloc func: __get_free_pages
> > alloc func: get_zeroed_page
> > alloc func: alloc_pages_exact
> > alloc func: __alloc_pages_direct_compact
> > alloc func: __alloc_pages_nodemask
> > alloc func: alloc_page_interleave
> > alloc func: alloc_pages_current
> > alloc func: alloc_pages_vma
> > alloc func: alloc_page_buffers
> > alloc func: alloc_pages_exact_nid
> >
> > After skipping those function, it got '__page_cache_alloc'.
> >
> > Other information such as allocation order, migration type and gfp
> > flags are provided by tracepoint events.
> >
> > Basically the output will be sorted by total allocation bytes, but you
> > can change it by using -s/--sort option. The following sort keys are
> > added to support page analysis: page, order, mtype, gfp. Existing
> > 'callsite', 'bytes' and 'hit' sort keys also can be used.
> >
> > An example follows:
> >
> > # perf kmem record --slab --page sleep 1
> > [ perf record: Woken up 0 times to write data ]
> > [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ]
> >
> > # perf kmem stat --page --caller -l 10 -s order,hit
> >
> > --------------------------------------------------------------------------------------------
> > Total_alloc/Per | Hit | Order | Migrate type | GFP flag | Callsite
>
> s/Per/Size
> s/Hit/Hits
> s/Migrate type/Migration type
> s/GFP flag/GFP flags
>
> ?

OK, will change. (They'll spend a bit more column spaces though.)


>
> > --------------------------------------------------------------------------------------------
> > 65536/16384 | 4 | 2 | RECLAIMABLE | 00285250 | new_slab
> > 51347456/4096 | 12536 | 0 | MOVABLE | 0102005a | __page_cache_alloc
> > 53248/4096 | 13 | 0 | UNMOVABLE | 002084d0 | pte_alloc_one
> > 40960/4096 | 10 | 0 | MOVABLE | 000280da | handle_mm_fault
> > 28672/4096 | 7 | 0 | UNMOVABLE | 000000d0 | __pollwait
> > 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_wp_page
> > 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_cow_fault
> > 16384/4096 | 4 | 0 | UNMOVABLE | 00000200 | __tlb_remove_page
> > 16384/4096 | 4 | 0 | UNMOVABLE | 000084d0 | __pmd_alloc
> > 8192/4096 | 2 | 0 | UNMOVABLE | 000084d0 | __pud_alloc
> > ... | ... | ... | ... | ... | ...
> > --------------------------------------------------------------------------------------------
> >
> > SUMMARY (page allocator)
> > ========================
> > Total alloc requested: 12593
> > Total alloc failure : 0
> > Total bytes allocated: 51630080
> > Total free requested: 115
> > Total free unmatched: 67
> > Total bytes freed : 471040
>
> I'd suggest the following changes to the format:
>
> - Collapse stats into 3 groups: 'allocated+freed', 'allocated only',
> 'freed only', depending on how much of their lifetime we've
> managed to trace. These groups are really distinct and it makes
> little sense to mix up their stats.

Good idea. Actually I'm thinking about a new option that shows only
lively allocated memory (excluding freed page) in the table. FYI
current number is total allocated memory (including freed page).


>
> - Add commas to the numbers, to make it easier to read and compare
> larger numbers.

OK

>
> - Right-align the numbers, to make them easy to compare when they
> are placed under each other.

OK

>
> - Merge the 'count' and 'bytes' stats into a single line, so that
> it's more compact, easier to navigate, but also only comparable
> type numbers are placed under each other.

OK

>
> I.e. something like this (mockup) output:
>
> SUMMARY (page allocator)
> ========================
>
> Pages allocated+freed: 12,593 [ 51,630,080 bytes ]
>
> Pages allocated-only: 2,342 [ 1,235,010 bytes ]
> Pages freed-only: 67 [ 135,311 bytes ]
>
> Page allocation failures : 0

Looks a lot better!

One thing I need to tell you is that the numbers are not pages but
requests.


>
>
> > Order UNMOVABLE RECLAIMABLE MOVABLE RESERVED CMA/ISOLATE
> > ----- ------------ ------------ ------------ ------------ ------------
> > 0 32 0 12557 0 0
> > 1 0 0 0 0 0
> > 2 0 4 0 0 0
> > 3 0 0 0 0 0
> > 4 0 0 0 0 0
> > 5 0 0 0 0 0
> > 6 0 0 0 0 0
> > 7 0 0 0 0 0
> > 8 0 0 0 0 0
> > 9 0 0 0 0 0
> > 10 0 0 0 0 0
>
> Here I'd suggest the following refinements:
>
> - Use '.' instead of '0', to make actual nonzero values stand out
> visually, while still keeping a tabular format

OK

>
> - Merge the 'Reserved', 'CMA/Isolate' columns into a single 'Special'
> colum: this will be zero in 99.9% of the cases, as those pages
> mostly deal with driver interfaces, mostly used during init/deinit.

I'm not sure about the CMA pages..

>
> - Capitalize less.

OK

>
> - Use comma-separated numbers for better readability.

OK

>
> So something like this:
>
>
> Order Unmovable Reclaimable Movable Special
> ----- ------------ ------------ ------------ ------------
> 0 32 . 12,557 .
> 1 . . . .
> 2 . 4 . .
> 3 . . . .
> 4 . . . .
> 5 . . . .
> 6 . . . .
> 7 . . . .
> 8 . . . .
> 9 . . . .
> 10 . . . .
>
>
> Look for example how easily noticeable the '4' value is now, while it
> was pretty easy to miss in the original table.

Indeed!

>
> > I have some idea how to improve it. But I'd also like to hear other
> > idea, suggestion, feedback and so on.
>
> So there's one thing that would be useful: to track pages allocated on
> one node, but freed on another. Those kinds of allocation/free
> patterns are especially expensive and might make sense to visualize.

I think it can be done easily as slab analysis already contains the info.

Thanks for your useful feedbacks!
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/