Re: [PATCHSET 00/17] perf report: Add -F option for specifying output fields (v4)

From: Namhyung Kim
Date: Mon Apr 28 2014 - 21:13:43 EST

Next message: Xiangliang Yu: "RE: PATCH: mvsas: add support for Supermicro AOC-SAS2LP-MV8"
Previous message: Olof Johansson: "Re: [PATCH 5/5] ARM: sunxi: Remove sun4i and sun7i machine definitions"
In reply to: Don Zickus: "Re: [PATCHSET 00/17] perf report: Add -F option for specifying output fields (v4)"
Next in thread: Don Zickus: "Re: [PATCHSET 00/17] perf report: Add -F option for specifying output fields (v4)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Don,

On Mon, 28 Apr 2014 15:46:42 -0400, Don Zickus wrote:
> On Thu, Apr 24, 2014 at 05:00:15PM -0400, Don Zickus wrote:
>> On Thu, Apr 24, 2014 at 10:41:39PM +0900, Namhyung Kim wrote:
>> > Hi Don,
>> >
>> > 2014-04-23 (ì), 08:58 -0400, Don Zickus:
>> > > On Wed, Apr 23, 2014 at 03:15:35PM +0900, Namhyung Kim wrote:
>> > > > On Tue, 22 Apr 2014 17:16:47 -0400, Don Zickus wrote:
>> > > > > ./perf mem record -a grep -r foo /* > /dev/null
>> > > > > ./perf mem report -F overhead,symbol_daddr,pid -s symbol_daddr,pid --stdio
>> > > > >
>> > > > > I was thinking I could sort everything based on the symbol_daddr and pid.
>> > > > > Then re-sort the output to display the highest 'symbol_daddr,pid' pair.
>> > > > > But it didn't seem to work that way. Instead it seems like I get the
>> > > > > original sort just displayed in the -F format.
>> > > >
>> > > > Could you please show me the output of your example?
>> > >
>> > >
>> > > # To display the perf.data header info, please use --header/--header-only
>> > > # options.
>> > > #
>> > > # Samples: 96K of event 'cpu/mem-loads/pp'
>> > > # Total weight : 1102938
>> > > # Sort order : symbol_daddr,pid
>> > > #
>> > > # Overhead Data Symbol Command: Pid
>> > > # ........ ......................................................................
>> > > #
>> > > 0.00% [k] 0xffff8807a8c1cf80 grep:116437
>> > > 0.00% [k] 0xffff8807a8c8cee0 grep:116437
>> > > 0.00% [k] 0xffff8807a8dceea0 grep:116437
>> > > 0.01% [k] 0xffff8807a9298dc0 grep:116437
>> > > 0.01% [k] 0xffff8807a934be40 grep:116437
>> > > 0.00% [k] 0xffff8807a9416ec0 grep:116437
>> > > 0.02% [k] 0xffff8807a9735700 grep:116437
>> > > 0.00% [k] 0xffff8807a98e9460 grep:116437
>> > > 0.02% [k] 0xffff8807a9afc890 grep:116437
>> > > 0.00% [k] 0xffff8807aa64feb0 grep:116437
>> > > 0.02% [k] 0xffff8807aa6b0030 grep:116437
>> >
>> > Hmm.. it seems that it's exactly sorted by the data symbol addresses, so
>> > I don't see any problem here. What did you expect? If you want to see
>> > those symbol_daddr,pid pair to be sorted by overhead, you can use the
>> > one of -F or -s option only.
>>
>> Good question. I guess I was hoping to see things sorted by overhead, but
>> as you said removing all the -F options gives me that. I have been
>> distracted with other fires this week, I lost focus at what I was trying
>> to accomplish.
>>
>> Let me figure that out again and try to come up with a more clear email
>> explaining what I was looking for (for myself at least :-) ).
>
> Ok. I think I figured out what I need. This might be quite long..

Great. :)

>
>
> Our orignal concept for the c2c tool was to sort hist entries into
> cachelines, filter in only the HITMs and stores and re-sort based on
> cachelines with the most weight.
>
> So using today's perf with a new search called 'cacheline' to achieve
> this (copy-n-pasted):

Maybe 'd'cacheline is a more appropriate name IMHO.

>
> ----
> #define CACHE_LINESIZE 64
> #define CLINE_OFFSET_MSK (CACHE_LINESIZE - 1)
> #define CLADRS(a) ((a) & ~(CLINE_OFFSET_MSK))
> #define CLOFFSET(a) (int)((a) & (CLINE_OFFSET_MSK))
>
> static int64_t
> sort__cacheline_cmp(struct hist_entry *left, struct hist_entry *right)
> {
> u64 l, r;
> struct map *l_map, *r_map;
>
> if (!left->mem_info) return -1;
> if (!right->mem_info) return 1;
>
> /* group event types together */
> if (left->cpumode > right->cpumode) return -1;
> if (left->cpumode < right->cpumode) return 1;
>
> l_map = left->mem_info->daddr.map;
> r_map = right->mem_info->daddr.map;
>
> /* properly sort NULL maps to help combine them */
> if (!l_map && !r_map)
> goto addr;
>
> if (!l_map) return -1;
> if (!r_map) return 1;
>
> if (l_map->maj > r_map->maj) return -1;
> if (l_map->maj < r_map->maj) return 1;
>
> if (l_map->min > r_map->min) return -1;
> if (l_map->min < r_map->min) return 1;
>
> if (l_map->ino > r_map->ino) return -1;
> if (l_map->ino < r_map->ino) return 1;
>
> if (l_map->ino_generation > r_map->ino_generation) return -1;
> if (l_map->ino_generation < r_map->ino_generation) return 1;
>
> /*
> * Addresses with no major/minor numbers are assumed to be
> * anonymous in userspace. Sort those on pid then address.
> *
> * The kernel and non-zero major/minor mapped areas are
> * assumed to be unity mapped. Sort those on address.
> */
>
> if ((left->cpumode != PERF_RECORD_MISC_KERNEL) &&
> !l_map->maj && !l_map->min && !l_map->ino &&
> !l_map->ino_generation) {
> /* userspace anonymous */
>
> if (left->thread->pid_ > right->thread->pid_) return -1;
> if (left->thread->pid_ < right->thread->pid_) return 1;

Isn't it necessary to check whether the address is in a same map in case
of anon pages? I mean the daddr.al_addr is a map-relative offset so it
might have same value for different maps.

> }
>
> addr:
> /* al_addr does all the right addr - start + offset calculations */
> l = CLADRS(left->mem_info->daddr.al_addr);
> r = CLADRS(right->mem_info->daddr.al_addr);
>
> if (l > r) return -1;
> if (l < r) return 1;
>
> return 0;
> }
>
> ----
>
> I can get the following 'perf mem report' outputs
>
> I used a special program called hitm_test3 which purposely generates
> HITMs either locally or remotely based on cpu input. It does this by
> having processA grab lockX from cacheline1, release lockY from cacheline2,
> then processB grabs lockY from cacheline2 and releases lockX from
> cacheline1 (IOW ping pong two locks across two cachelines), found here
>
> http://people.redhat.com/dzickus/hitm_test/
>
> [ perf mem record -a hitm_test -s1,19 -c1000000 -t]
>
> (where -s is the cpus to bind to, -c is loop count, -t disables internal
> perf tracking)
>
> (using 'perf mem' to auto generate correct record/report options for
> cachelines)
> (the hitm counts should be higher, but sampling is a crapshoot. Using
> ld_lat=30 would probably filter most of the L1 hits)
>
> Table 1: normal perf
> #perf mem report --stdio -s cacheline,pid
>
>
> # Overhead Samples Cacheline Command: Pid
> # ........ ............ ....................... ....................
> #
> 47.61% 42257 [.] 0x0000000000000080 hitm_test3:146344
> 46.14% 42596 [.] 0000000000000000 hitm_test3:146343
> 2.16% 2074 [.] 0x0000000000003340 hitm_test3:146344
> 1.88% 1796 [.] 0x0000000000003340 hitm_test3:146343
> 0.20% 140 [.] 0x00007ffff291ce00 hitm_test3:146344
> 0.18% 126 [.] 0x00007ffff291ce00 hitm_test3:146343
> 0.10% 1 [k] 0xffff88042f071500 swapper: 0
> 0.07% 1 [k] 0xffff88042ef747c0 watchdog/11: 62
> ...
>
> Ok, now I know the hottest cachelines. Not to bad. However, in order to
> determine cacheline contention, it would be nice to know the offsets into
> the cacheline to see if there is contention or not. Unfortunately, the way
> the sorting works here, all the hist_entry data was combined into each
> cacheline, so I lose my granularity...
>
> I can do:
>
> Table 2: normal perf
> #perf mem report --stdio -s cacheline,pid,dso_daddr,mem
>
>
> # Overhead Samples Cacheline Command: Pid
> # Data Object Memory access
> # ........ ............ ....................... ....................
> # .............................. ........................
> #
> 45.24% 42581 [.] 0000000000000000 hitm_test3:146343 SYSV00000000 (deleted) L1 hit
> 44.43% 42231 [.] 0x0000000000000080 hitm_test3:146344 SYSV00000000 (deleted) L1 hit
> 2.19% 13 [.] 0x0000000000000080 hitm_test3:146344 SYSV00000000 (deleted) Local RAM hit
> 2.16% 2074 [.] 0x0000000000003340 hitm_test3:146344 hitm_test3 L1 hit
> 1.88% 1796 [.] 0x0000000000003340 hitm_test3:146343 hitm_test3 L1 hit
> 1.00% 13 [.] 0x0000000000000080 hitm_test3:146344 SYSV00000000 (deleted) Remote Cache (1 hop) hit
> 0.91% 15 [.] 0000000000000000 hitm_test3:146343 SYSV00000000 (deleted) Remote Cache (1 hop) hit
> 0.20% 140 [.] 0x00007ffff291ce00 hitm_test3:146344 [stack] L1 hit
> 0.18% 126 [.] 0x00007ffff291ce00 hitm_test3:146343 [stack] L1 hit
>
> Now I have some granularity (though the program keeps hitting the same
> offset in the cacheline) and some different levels of memory operations.
> Seems like a step forward. However, the cacheline is broken up a little
> bit (see 0x0000000000000080 is split up three ways).
>
> I can now see where the cache contention is but I don't know how prevalent
> it is (what percentage of the cacheline is under contention). No need to
> waste time with cachelines that have little or no contention.
>
> Hmm, what if I used the -F option to group all the cachelines and their
> offsets together.
>
> Table 3: perf with -F
> #perf mem report --stdio -s cacheline,pid,dso_daddr,mem -i don.data -F cacheline,pid,dso_daddr,mem,overhead,sample|grep 0000000000000
> [k] 0000000000000000 swapper: 0 [kernel.kallsyms] Uncached hit 0.00% 1
> [k] 0000000000000000 kipmi0: 1500 [kernel.kallsyms] Uncached hit 0.02% 1
> [.] 0000000000000000 hitm_test3:146343 SYSV00000000 (deleted) L1 hit 45.24% 42581
> [.] 0000000000000000 hitm_test3:146343 SYSV00000000 (deleted) Remote Cache (1 hop) hit 0.91% 15
> [.] 0x0000000000000080 hitm_test3:146344 SYSV00000000 (deleted) L1 hit 44.43% 42231
> [.] 0x0000000000000080 hitm_test3:146344 SYSV00000000 (deleted) Local RAM hit 2.19% 13
> [.] 0x0000000000000080 hitm_test3:146344 SYSV00000000 (deleted) Remote Cache (1 hop) hit 1.00% 13
>
> Now I have the ability to see the whole cacheline easily and can probably
> roughly calculate the contention in my head. Of course there was some
> pre-determined knowledge that was needed to get this info (like which
> cacheline is interesting from Table 1).
>
> Of course, our c2c tool was trying to make the output more readable and
> more obvious such that the user didn't have to know what to look for.
>
> Internally our tool sorts similar to Table2, but then resorts onto a new
> rbtree with a struct c2c_hit based on the hottest cachelines. Based on
> this new rbtree we can print our analysis easily.
>
> This new rbtree is slightly different than the -F output in that we
> 'group' cacheline entries together and re-sort that group. The -F option
> just resorts the sorted hist_entry and has no concept of grouping.
>
>
>
>
> We would prefer to have a 'group' sorting concept as we believe that is
> the easiest way to organize the data. But I don't know if that can be
> incorporated into the 'perf' tool itself, or just keep that concept local
> to our flavor of the perf subcommand.
>
> I am hoping this semi-concocted example gives a better picture of the
> problem I am trying to wrestle with.

Yep, I understand your problem.

And I think it's good for having the group sorting concept in perf tools
for general use. But it has a conflict with the proposed change of -F
option when non-sort keys are used for the -s or -F. So it needs more
thinking..

Unfortunately I'll be busy by the end of next week. So I'll be able to
discuss and work on it later.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Xiangliang Yu: "RE: PATCH: mvsas: add support for Supermicro AOC-SAS2LP-MV8"
Previous message: Olof Johansson: "Re: [PATCH 5/5] ARM: sunxi: Remove sun4i and sun7i machine definitions"
In reply to: Don Zickus: "Re: [PATCHSET 00/17] perf report: Add -F option for specifying output fields (v4)"
Next in thread: Don Zickus: "Re: [PATCHSET 00/17] perf report: Add -F option for specifying output fields (v4)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]