Re: [perfmon2] IV.3 - AMD IBS

From: stephane eranian
Date: Thu Jun 25 2009 - 07:29:05 EST

Next message: Arnd Bergmann: "Re: [uClinux-dev] [PATCH] m68k: restore lost coldfire CLOCK_TICK_RATE"
Previous message: Gregory Haskins: "Re: [PATCH] kvm: remove in_range from kvm_io_device"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On Tue, Jun 23, 2009 at 4:55 PM, Ingo Molnar<mingo@xxxxxxx> wrote:
>
> The 20 bits delay is in cycles, right? So this in itself still lends
> itself to be transparently provided as a PERF_COUNT_HW_CPU_CYCLES
> counter.
>

I do not believe you can use IBS as a better substitute for either CYCLES or
INSTRUCTIONS sampling. IBS simply does not operate in the same way.

But instead of me arguing with you guys for a long time, I have asked someone
at AMD who knows more than me about IBS. Paul posted his answer only on
the perfmon2 mailing list, I have forwarded it below.

You will also note that he is providing another example as to why support for
software sampling period randomization is useful.

I would like to thank Paul for spending time providing a lot of useful details
about IBS.

I am hoping this can clarify things.

On Wed, Jun 24, 2009 at 8:20 PM, Drongowski,
Paul<paul.drongowski@xxxxxxx> wrote:
>
> Hi --
>
> I'm sorry to be joining this discussion so late. A few of my
> colleagues pointed me toward the current thread on IBS and I've tried
> to catch up by reading the archives. A short self-introduction: I'm a
> member of the AMD CodeAnalyst team, Ravi Bhargava and I wrote Appendix G
> (concerning IBS) of the AMD Software Optimization Guide for AMD
> Family 10h Processors and at one point in my life, I worked on DCPI
> (using ProfileMe).
>
> First off, Stephane and Rob have done a good job representing IBS and
> also ProfileMe. Thanks, guys!
>
> Rather than grossly disturb the current discussion, I'd like to offer
> a few points of clarification and maybe a little useful history.
>
> Peter's observation that IBS is a "mismatch with the traditional one
> value per counter thing" is quite apt. IBS has similarities to
> ProfileMe. Stephane's citation of the Itanium Data-EAR and
> Instruction-EAR are also very relevant as examples of profile data
> that do not fit with the "one value per counter thing."
>
> IBS Fetch.
>
> Â ÂIBS fetch sampling does not exactly sample x86 instructions. The
> Â Âcurrent fetch counter counts fetch operations where a fetch
> operation
> Â Âmay be a 32-byte fetch block (on AMD Family 10h) or it may be a
> Â Âfetch operation initiated by a redirection such as a branch.
> Â ÂA fetch block is 32 bytes of instruction information which is
> Â Âsent to the instruction decoder. The fetch address that is reported
> Â Âmay either be the start of a valid x86 instruction or the start of
> Â Âa fetch block. In the second case, the address may be in the middle
> of
> Â Âan x86 instruction.
>
> Â ÂIBS fetch sampling produces a number of event flags (e.g.,
> instruction
> Â Âcache miss), but it also produces the latency (in cycles) of the
> Â Âfetch operation. The latencies can be accumulated in either
> Â Âdescriptive statistics, or better, in a histogram since descriptive
> Â Âstatistics don't really show where an access is hitting in the
> Â Âmemory hierarchy. BTW, even though an IBS fetch sample may be
> reported,
> Â Âthe decoder may not use the instruction bytes due to a late arriving
> Â Âredirection.
>
> IBS Op.
>
> Â ÂIBS op sampling does not sample x86 instructions. It samples the
> Â Âops which are issued from x86 instructions. Some x86 instructions
> Â Âissue more than one op. Microcoded instructions are particularly
> Â Âthorny as a single REP MOV may issue many ops, thereby affecting
> Â Âthe number of samples that fall on them (i.e., disproportionate to
> the
> Â Âexecution frequency of the surrounding basic block.) The number of
> Â Âops issued is data dependent and is unpredictable. Appendix C
> Â Âof the Software Optimization Guide lists the number of ops issued
> Â Âfrom x86 instructions (one, two or many).
>
> Â ÂBeginning with AMD Family 10h RevC, there are two op selection
> Â Â(counting) modes for IBS: cycles-counting and dispatched op
> counting.
>
> Â ÂCycles-counting is _not_ equivalent to CPU_CLK_UNHALTED -- it is
> Â Ânot a precise version of the performance monitoring counter (PMC)
> Â Âevent (event select 0x076). In cycles-mode, when the current count
> Â Âreaches the max count, the next available dispatch group of ops is
> Â Âselected and a secondary mechanism selects an op within the dispatch
> Â Âgroup. The dispatch group may contain one, two or three ops. If you
> Â Âsmell a rat, you're right. The secondary scheme negatively affects
> Â Âthe desired pseudo-random selection scheme. Also, if a dispatch
> Â Âgroup is not available, the sample is skipped and the counting
> Â Âprocess is reset.
>
> Â ÂFurther, cycles-mode selection is affected by pipeline stalls. This
> Â Âaffects the distribution of IBS op samples taken in cycles-mode.
> Â ÂWith cycles-mode, one instruction may have more data cache miss
> events,
> Â Âbut the underlying sampling basis is so skewed that the comparison
> is
> Â Ânot meaningful. IBS op samples are generated only for ops that
> retire;
> Â Âtagged ops on a "wrong path" are flushed without producing a sample.
> Â ÂOverall, I cannot personally say that IBS cycles-mode produces a
> precise
> Â Âequivalent to CPU_CLK_UNHALTED. I cannot endorse or recommend
> Â Âits use in this way.
>
> Â ÂGiven these issues, dispatched op counting was added in RevC. This
> mode
> Â Âis the _preferred_ mode. Ops are counted as they are dispatched and
> the
> Â Âop that triggers the max count threshold is selected and tagged.
> Â ÂDispatched op mode produces a distribution of op samples that
> reflects
> Â Âthe execution frequency of instructions/basic blocks. DirectPath
> Â ÂDouble and VectorPath (microcoded) x86 instructions which issue more
> than
> Â Âone op will still be oversampled, however. The distribution is
> important
> Â Âbecause it allows meaningful comparison of event counts between
> Â Âinstructions.
>
> Â ÂEven though the distribution of samples in dispatched op mode
> reflects
> Â Âexecution frequency, it is not a substitute for RETIRED_INSTRUCTIONS
> Â Â(event select 0x0c0). The number of IBS op samples in some
> workloads,
> Â Âespecially those with certain kinds of stack access and microcoded
> Â Âinstructions, diverges greatly from RETIRED_INSTRUCTIONS.
>
> Â ÂIBS is what it is.
>
> IBS derived events
>
> Â ÂSince ProfileMe and Data EAR didn't exactly take the world by storm,
> Â Â(oh, yeah, I worked with HP Caliper on Itanium for a while, too ;-),
> Â Âprofiling infrastructures like OProfile and CodeAnalyst are largely
> Â Âbased on the PMC sampling model.
>
> Â ÂIn order to get IBS into practice as quickly as possible, we defined
> Â ÂIBS derived events. This allowed us to implement basic support for
> Â ÂIBS in both OProfile and CodeAnalyst without major changes in
> Â Âinfrastructure. I should note that translation from raw IBS bits to
> Â Âderived events is and was always intended to be performed by user
> Â Âspace tools. I personally believe that translation should not be
> Â Âperformed in the kernel -- kernel support should be simple and
> Â Âlightweight.
>
> Â ÂAn IBS op sample is a small "packet" of profile data:
>
> Â Â Â ÂA bunch of event flags (data cache miss, etc.)
> Â Â Â ÂTag-to-retire time (cycles)
> Â Â Â ÂCompletion-to-retire (cycles)
> Â Â Â ÂDC miss latency (cycles)
> Â Â Â ÂDC miss addresses (64-bit virtual and physical addresses)
>
> Â ÂThese entities can be used to compute latency distributions,
> Â Âmemory access maps, etc. IBS enables new kinds of analysis such
> Â Âas data-centric profiling that identifies hot data regions (that
> Â Âcould be used to tune data layout in NUMA environment).
>
> Â ÂQuite frankly, at this juncture, I find the derived event model to
> be
> Â Âtoo limiting. DCPI had a much different way of organizing ProfileMe
> Â Âdata that allowed flexible formulation of queries during
> post-processing --
> Â Âsomething that cannot be done with the derived event approach.
>
> Â ÂFurther, the organization and use of DC miss addresses is open for
> Â Âinvestigation. I would _love_ to encourage someone (anyone? anyone?)
> Â Âto take up this investigation. There may also be unforeseen uses --
> Â Âperhaps driving compile-time optimizations. The existing derived
> events
> Â Âdo not adequately support new applications of IBS data. Thus, I
> would
> Â Âencourage kernel-level support that passes IBS data along without
> Â Âmodification.
>
> Filtering.
>
> Â ÂAfter our initial experience with IBS, we see the need for
> filtering.
> Â ÂOne approach is to collect and report only those IBS register values
> Â Âthat are needed to support a certain kind of analysis. For example,
> Â Âif the DC miss addresses are not needed, why collect them? Suravee
> Â Âand Robert Richter (both terrific colleagues) have been
> investigating
> Â Âthis, so I will defer to their analysis and comments.
>
> Software randomization.
>
> Â ÂWe've found that software randomization of the sampling period
> and/or
> Â Âcurrent count is needed to avoid certain situations where the
> pipeline
> Â Âand the sampling process get into a periodic hard-loop that affects
> Â Âthe distribution of IBS op samples. BTW, forcing those low order
> four
> Â Âbits to zero occasionally has a negative effect on op distribution.
>
> IBS future extensions
>
> Â ÂOf course, I can't discuss specific new features. However, here are
> Â Âsome possible variations:
>
> Â Â Â * The current count and max count values may become longer.
> Â Â Â * New event flags may be added.
> Â Â Â * Existing event flags may be left out (i.e., not implemented
> Â Â Â Â in a family or model)
> Â Â Â * New ancillary data (like DC miss latency or DC miss address)
> Â Â Â Â may be added.
>
> Â ÂIt may be necessary to collect new 64-bit values that do not contain
> Â Âevent flags, for example.
>
> Thanks for enduring this long-winded message. I hope that I've
> communicated some information and requirements, and I'll be more than
> happy to answer questions about IBS (or get the answers).
>
> -- pj
>
> Dr. Paul Drongowski
> AMD CodeAnalyst team
> Boston Design Center
>
> -------------------------
> The information presented in this reply is for informational purposes
> only and may contain technical inaccuracies, omissions and
> typographical errors. Links to third party sites are for convenience
> only, and no endorsement is implied.
>
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Arnd Bergmann: "Re: [uClinux-dev] [PATCH] m68k: restore lost coldfire CLOCK_TICK_RATE"
Previous message: Gregory Haskins: "Re: [PATCH] kvm: remove in_range from kvm_io_device"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]