Re: [PATCH v7 07/18] perf: add generic memory sampling interface

From: Benjamin Herrenschmidt
Date: Fri Feb 15 2013 - 21:46:13 EST


On Fri, 2013-02-15 at 11:46 -0800, Sukadev Bhattiprolu wrote:
>
> POWER could use an additional field:
>
> mem_deratmiss:1

If you want to make that field more "generic" make it "lvl1_tlb_miss",
ie, a miss in the internal "level 1" TLB which is the smallest/fastest
TLB level in the load/store unit.

> AFAICT, POWER does not currently save the mem_op, snoop or lock info
> for the sampled instruction. I guess we can leave them set to 0.

Well, we don't have lock instructions to begin with :-) If we can read
the IP then we can deduce the memop tho.

> > > +};
> > > +
> > > +/* type of opcode (load/store/prefetch,code) */
> > > +#define PERF_MEM_OP_NA 0x01 /* not available */
> > > +#define PERF_MEM_OP_LOAD 0x02 /* load instruction */
> > > +#define PERF_MEM_OP_STORE 0x04 /* store instruction */
> > > +#define PERF_MEM_OP_PFETCH 0x08 /* prefetch */
> > > +#define PERF_MEM_OP_EXEC 0x10 /* code (execution) */
> > > +#define PERF_MEM_OP_SHIFT 0
> > > +
> > > +/* memory hierarchy (memory level, hit or miss) */
> > > +#define PERF_MEM_LVL_NA 0x01 /* not available */
> > > +#define PERF_MEM_LVL_HIT 0x02 /* hit level */
> > > +#define PERF_MEM_LVL_MISS 0x04 /* miss level */
> > > +#define PERF_MEM_LVL_L1 0x08 /* L1 */
> > > +#define PERF_MEM_LVL_LFB 0x10 /* Line Fill Buffer */
> > > +#define PERF_MEM_LVL_L2 0x20 /* L2 hit */
> > > +#define PERF_MEM_LVL_L3 0x40 /* L3 hit */
> > > +#define PERF_MEM_LVL_LOC_RAM 0x80 /* Local DRAM */
> > > +#define PERF_MEM_LVL_REM_RAM1 0x100 /* Remote DRAM (1 hop)
> */
> > > +#define PERF_MEM_LVL_REM_RAM2 0x200 /* Remote DRAM (2 hops)
> */
> > > +#define PERF_MEM_LVL_REM_CCE1 0x400 /* Remote Cache (1 hop)
> */
> > > +#define PERF_MEM_LVL_REM_CCE2 0x800 /* Remote Cache (2 hops)
> */
> > > +#define PERF_MEM_LVL_IO 0x1000 /* I/O memory */
> > > +#define PERF_MEM_LVL_UNC 0x2000 /* Uncached memory */
> > > +#define PERF_MEM_LVL_SHIFT 5
>
> POWER saves following information to describe where the data was
> loaded from after a Dcache or DTLB miss.
>
> FROM_L2
> FROM_L3
>
> FROM_L2.1_SHR From another L2 or L3 on same chip,
> shared
> FROM_L2.1_MOD From another L2 or L3 on same chip, modified
>
> FROM_L3.1_SHR From remote L2 or L3, shared
> FROM_L3.1_MOD From remote L2 or L3, modified
>
> FROM_RL2L3_SHR From remote L2 or L3, shared
> FROM_RL2L3_MOD From remote L2 or L3, modified
>
> FROM_DL2L3_SHR From distant L2 or L3, shared
> FROM_DL2L3_MOD From distant L2 or L3, modified
>
> POWER uses 4 bits and a running count for its (currently) 13 possible
> values.
>
> The macros in the patch use a separate bit for each level - is that to
> allow
> selecting more than one level at the same time ? If so, we will need
> to reserve
> a few more bits to allow for Power's memory levels that don't map to
> the above.
>
> > > +
> > > +/* snoop mode */
> > > +#define PERF_MEM_SNOOP_NA 0x01 /* not available */
> > > +#define PERF_MEM_SNOOP_NONE 0x02 /* no snoop */
> > > +#define PERF_MEM_SNOOP_HIT 0x04 /* snoop hit */
> > > +#define PERF_MEM_SNOOP_MISS 0x08 /* snoop miss */
> > > +#define PERF_MEM_SNOOP_HITM 0x10 /* snoop hit modified */
> > > +#define PERF_MEM_SNOOP_SHIFT 19
> > > +
> > > +/* locked instruction */
> > > +#define PERF_MEM_LOCK_NA 0x01 /* not available */
> > > +#define PERF_MEM_LOCK_LOCKED 0x02 /* locked transaction */
> > > +#define PERF_MEM_LOCK_SHIFT 24
> > > +
> > > +/* TLB access */
> > > +#define PERF_MEM_TLB_NA 0x01 /* not available */
> > > +#define PERF_MEM_TLB_HIT 0x02 /* hit level */
> > > +#define PERF_MEM_TLB_MISS 0x04 /* miss level */
> > > +#define PERF_MEM_TLB_L1 0x08 /* L1 */
> > > +#define PERF_MEM_TLB_L2 0x10 /* L2 */
> > > +#define PERF_MEM_TLB_WK 0x20 /* Hardware Walker*/
> > > +#define PERF_MEM_TLB_OS 0x40 /* OS fault handler */
> > > +#define PERF_MEM_TLB_SHIFT 26
>
> On POWER, like with the Dcache source above, we have 4 bits to
> describe where
> the DTLB was loaded from after a dTLB miss.
>
> We would probably need to allow more bits to for the memory level of
> the dTLB
> load source.
>
> > > +
> > > +#define PERF_MEM_S(a, s) \
> > > + (((u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
> > > +
> >
> > Would be nice to get feedback from PowerPC folks to see how well
> > this matches their memory profiling hw capabilities?
> >
> > I suspect there's a lot of differences, but one can always hope
> > ...
> >
> > If there's some hope for unification we could at least shape it
> > in a way that they could pick up and extend.
>
> Thanks for Ccing.
>
> While on the topic of sampled instructions, POWER saves following
> information
> (in addition to the above memory info) for sampled instructions.
>
> - whether the sampled instruction encountered a stall
> - the reasons for the stall.
> - whether the instruction was from hypervisor
> - there was a branch mis-predict,
> - thresholding information
>
> These are clubbed into an "event vector" that is saved for sampled
> instructions. We have been meaning to find ways to present that to
> to user space. Are there plans to retreive and present these too.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/