Re: [PATCH 0/4] perf: memory load/store events generalization

From: Stephane Eranian
Date: Fri Jul 22 2011 - 14:55:49 EST


Lin,

On Mon, Jul 4, 2011 at 1:02 AM, Lin Ming <ming.m.lin@xxxxxxxxx> wrote:
> Hi, all
>
> Intel PMU provides 2 facilities to monitor memory operation: load latency and precise store.
> This patchset tries to generalize memory load/store events.
> So other arches may also add such features.
>
> A new sub-command "mem" is added,
>
> $ perf mem
>
> Âusage: perf mem [<options>] {record <command> |report}
>
> Â Â-t, --type <type> Â Â memory operations(load/store)
> Â Â-L, --latency <n> Â Â latency to sample(only for load op)
>
That looks okay as a first approach tool. But what people are most
often interested in is to see where the misses occur, i.e., you need
to display load/store addresses somehow, especially for the more
costly misses (the ones the compiler cannot really hide by hoisting
loads).

> $ perf mem -t load record make -j8
>
> <building kernel ..., monitoring memory load opeartion>
>
> $ perf mem -t load report
>
> Memory load operation statistics
> ================================
> Â Â Â Â Â Â Â Â Â Â ÂL1-local: total latency= Â 28027, count= Â Â3355(avg=8)

That's wrong. On Intel, you need to subtract 4 cycles from the latency
you get out of PEBS-LL. The kernel can do that.

> Â Â Â Â Â Â Â Â Â Â ÂL2-snoop: total latency= Â Â1430, count= Â Â Â29(avg=49)

I suspect L2-snoop is not correct. If this line item relates to bit 2 of
the data source, then it corresponds to a secondary miss. That means
you have a load to a cache-line that is already being requested.

> Â Â Â Â Â Â Â Â Â Â ÂL2-local: total latency= Â Â 124, count= Â Â Â 8(avg=15)
> Â Â Â Â Â Â L3-snoop, found M: total latency= Â Â 452, count= Â Â Â 4(avg=113)
> Â Â Â Â ÂL3-snoop, found no M: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> L3-snoop, no coherency actions: total latency= Â Â 875, count= Â Â Â18(avg=48)
> Â Â Â ÂL3-miss, snoop, shared: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> Â Â L3-miss, local, exclusive: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> Â Â Â ÂL3-miss, local, shared: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> Â ÂL3-miss, remote, exclusive: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> Â Â Â L3-miss, remote, shared: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> Â Â Â Â Â Â Â Â Â ÂUnknown L3: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> Â Â Â Â Â Â Â Â Â Â Â Â Â ÂIO: total latency= Â Â Â 0, count= Â Â Â 0(avg=0)
> Â Â Â Â Â Â Â Â Â Â ÂUncached: total latency= Â Â 464, count= Â Â Â30(avg=15)
>
I think it would be more useful to print the % of loads captured for
each category.

> $ perf mem -t store record make -j8
>
> <building kernel ..., monitoring memory store opeartion>
>
> $ perf mem -t store report
>
> Memory store operation statistics
> =================================
> Â Â Â Â Â Â Â Âdata-cache hit: Â Â 8138
> Â Â Â Â Â Â Â data-cache miss: Â Â Â Â0
> Â Â Â Â Â Â Â Â Â Â ÂSTLB hit: Â Â 8138
> Â Â Â Â Â Â Â Â Â Â STLB miss: Â Â Â Â0
> Â Â Â Â Â Â Â Â Locked access: Â Â Â Â0
> Â Â Â Â Â Â Â Unlocked access: Â Â 8138
>
> Any comment is appreciated.
>
> Thanks,
> Lin Ming
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/