Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

From: Ravi Bangoria
Date: Wed Mar 04 2020 - 23:47:08 EST

Next message: Vivek Thampi: "Re: [PATCH RESEND] ptp: add VMware virtual PTP clock driver"
Previous message: Sumit Garg: "Re: [PATCH v2] MAINTAINERS: adjust to trusted keys subsystem creation"
In reply to: Kim Phillips: "Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information"
Next in thread: Kim Phillips: "Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Kim,

Sorry about being bit late.

On 3/3/20 3:55 AM, Kim Phillips wrote:

On 3/2/20 2:21 PM, Stephane Eranian wrote:

On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:

Modern processors export such hazard data in Performance
Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
AMD[3] provides similar information.

Implementation detail:

A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
If it's set, kernel converts arch specific hazard information
into generic format:

struct perf_pipeline_haz_data {
/* Instruction/Opcode type: Load, Store, Branch .... */
__u8 itype;
/* Instruction Cache source */
__u8 icache;
/* Instruction suffered hazard in pipeline stage */
__u8 hazard_stage;
/* Hazard reason */
__u8 hazard_reason;
/* Instruction suffered stall in pipeline stage */
__u8 stall_stage;
/* Stall reason */
__u8 stall_reason;
__u16 pad;
};

Kim, does this format indeed work for AMD IBS?

It's not really 1:1, we don't have these separations of stages
and reasons, for example: we have missed in L2 cache, for example.
So IBS output is flatter, with more cycle latency figures than
IBM's AFAICT.

AMD IBS captures pipeline latency data incase Fetch sampling like the
Fetch latency, tag to retire latency, completion to retire latency and
so on. Yes, Ops sampling do provide more data on load/store centric
information. But it also captures more detailed data for Branch instructions.
And we also looked at ARM SPE, which also captures more details pipeline
data and latency information.

Personally, I don't like the term hazard. This is too IBM Power
specific. We need to find a better term, maybe stall or penalty.

Right, IBS doesn't have a filter to only count stalled or otherwise
bad events. IBS' PPR descriptions has one occurrence of the
word stall, and no penalty. The way I read IBS is it's just
reporting more sample data than just the precise IP: things like
hits, misses, cycle latencies, addresses, types, etc., so words
like 'extended', or the 'auxiliary' already used today even
are more appropriate for IBS, although I'm the last person to
bikeshed.

We are thinking of using "pipeline" word instead of Hazard.

Also worth considering is the support of ARM SPE (Statistical
Profiling Extension) which is their version of IBS.
Whatever gets added need to cover all three with no limitations.

I thought Intel's various LBR, PEBS, and PT supported providing
similar sample data in perf already, like with perf mem/c2c?

perf-mem is more of data centric in my opinion. It is more towards
memory profiling. So proposal here is to expose pipeline related
details like stalls and latencies.

Thanks for the review,
Ravi

Next message: Vivek Thampi: "Re: [PATCH RESEND] ptp: add VMware virtual PTP clock driver"
Previous message: Sumit Garg: "Re: [PATCH v2] MAINTAINERS: adjust to trusted keys subsystem creation"
In reply to: Kim Phillips: "Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information"
Next in thread: Kim Phillips: "Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]