Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores

From: Ali Saidi
Date: Sun Apr 03 2022 - 16:33:45 EST




On Thu, 31 Mar 2022 12:44:3 +0000, Leo Yan wrote:
>
> On Thu, Mar 31, 2022 at 01:28:58PM +0100, German Gomez wrote:
> > Hi all,
> >
> > It seems I gave the Review tags a bit too early this time. Apologies for
> > the inconvenience. Indeed there was more interesting discussions to be
> > had :)
> >
> > (Probably best to remove by tags for the next re-spin)
>
> Now worries, German. Your review and testing are very helpful :)
>
> > On 29/03/2022 15:32, Ali Saidi wrote:
> > > [...]
> > >
> > >> I still think we should consider to extend the memory levels to
> > >> demonstrate clear momory hierarchy on Arm archs, I personally like the
> > >> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> > >> though these cache levels are not precise like L1/L2/L3 levels, they can
> > >> help us to map very well for the cache topology on Arm archs and without
> > >> any confusion. We could take this as an enhancement if you don't want
> > >> to bother the current patch set's upstreaming.
> > > I'd like to do this in a separate patch, but I have one other proposal. The
> > > Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> > > it's also in the L2. Given that the Graviton systems and afaik the Ampere
> > > systems don't have any cache between the L2 and the SLC, thus anything from
> > > PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> > > should just set L2 for these cases? German, are you good with this for now?
> >
> > Sorry for the delay. I'd like to also check this with someone. I'll try
> > to get back asap. In the meantime, if this approach is also OK with Leo,
> > I think it would be fine by me.
>
> Thanks for the checking internally. Let me just bring up my another
> thinking (sorry that my suggestion is float): another choice is we set
> ANY_CACHE as cache level if we are not certain the cache level, and
> extend snoop field to indicate the snooping logics, like:
>
> PERF_MEM_SNOOP_PEER_CORE
> PERF_MEM_SNOOP_LCL_CLSTR
> PERF_MEM_SNOOP_PEER_CLSTR
>
> Seems to me, we doing this is not only for cache level, it's more
> important for users to know the variant cost for involving different
> snooping logics.

I think we've come full circle :). Going back to what do we want to indicate to
a user about the source of the cache line, I believe there are three things with
an eye toward helping a user of the data improve the performance of their
application:

1. The level below them in the hierarchy it it (L1, L2, LLC, local DRAM).
Depending on the level this directly indicates the expense of the operation.

2. If it came from a peer of theirs on the same socket. I'm really of the
opinion still that exactly which peer, doesn't matter much as it's a 2nd or 3rd
order concern compared to, it it couldn't be sourced from a cache level below
the originating core, had to come from a local peer and the request went to
that lower levels and was eventually sourced from a peer. Why it was sourced
from the peer is still almost irrelevant to me. If it was truly modified or the
core it was sourced from only had permission to modify it the snoop filter
doesn't necessarily need to know the difference and the outcome is the same.

3. For multi-socket systems that it came from a different socket and there it is
probably most interesting if it came from DRAM on the remote socket or a cache.

I'm putting 3 aside for now since we've really been focusing on 1 and 2 in this
discussion and I think the biggest hangup has been the definition of HIT vs
HITM. If someone has a precise definition, that would be great, but AFAIK it
goes back to the P6 bus where HIT was asserted by another core if it had a line
(in any state) and HITM was additionally asserted if a core needed to inhibit
another device (e.g. DDR controller) from providing that line to the requestor.

The latter logic is why I think it's perfectly acceptable to use HITM to
indicate a peer cache-to-cache transfer, however since others don't feel that way
let me propose a single additional snooping type PERF_MEM_SNOOP_PEER that
indicates some peer of the hierarchy below the originating core sourced the
data. This clears up the definition that line came from from a peer and may or
may not have been modified, but it doesn't add a lot of implementation dependant
functionality into the SNOOP API.

We could use the mem-level to indicate the level of the cache hierarchy we had
to get to before the snoop traveled upward, which seems like what x86 is doing
here.

PEER_CORE -> MEM_SNOOP_PEER + L2
PEER_CLSTR -> MEM_SNOOP_PEER + L3
PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
the clusters and the existing commercial implementations don't have them).

Thanks,
Ali