Re: [PATCH v2] perf c2c: Add report option to show false sharing in adjacent cachelines

From: Feng Tang
Date: Tue Feb 14 2023 - 20:10:15 EST


Hi Arnaldo,

On Tue, Feb 14, 2023 at 04:30:56PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Feb 14, 2023 at 03:58:23PM +0800, Feng Tang escreveu:
> > Many platforms have feature of adjacent cachelines prefetch, when it
> > is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> > if one is fetched to cache, the other one could likely be fetched too,
> > which sort of extends the cacheline size to double, thus the false
> > sharing could happens in adjacent cachelines.
> >
> > 0Day has captured performance changed related with this [1], and some
> > commercial software explicitly makes its hot global variables 128 bytes
> > aligned (2 cache lines) to avoid this kind of extended false sharing.
> >
> > So add an option "-a" or "--double-cl" for c2c report to show false
> > sharing in double cache line granularity, which acts just like the
> > cacheline size is doubled. There is no change to c2c record. The
> > hardware events of shared cacheline are still per cacheline, and
> > this option just changes the granularity of how events are grouped
> > and displayed.
> >
> > In the c2c report below (will-it-scale's 'pagefault2' case on old kernel):
> >
> > ----------------------------------------------------------------------
> > 26 31 2 0 0 0 0xffff888103ec6000
> > ----------------------------------------------------------------------
> > 35.48% 50.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff8133148b 1153 66 971 3748 74 [k] get_mem_cgroup_from_mm
> > 6.45% 0.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff813396e4 570 0 1531 879 75 [k] mem_cgroup_charge
> > 25.81% 50.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81331472 949 70 593 3359 74 [k] get_mem_cgroup_from_mm
> > 19.35% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81339686 1352 0 1073 1022 74 [k] mem_cgroup_charge
> > 9.68% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff813396d6 1401 0 863 768 74 [k] mem_cgroup_charge
> > 3.23% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81333106 618 0 804 11 9 [k] uncharge_batch
> >
> > The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> > are listed together to give users a hint of extended false sharing.
> >
> > [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> >
> > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
> > Reviewed-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> > Reviewed-by: Leo Yan <leo.yan@xxxxxxxxxx>
> > Tested-by: Leo Yan <leo.yan@xxxxxxxxxx>
> > ---
> > Changelog:
> >
> > v1 -> v2:
> > * Refine comments and fix typos (Leo Yan)
> > * Add reviewd-by and tested-by(for Arm64) tag from Leo Yan
> > * Refine cmd description and commit log to avoid using
> > architecture specific name (Joe Mario)
> >
> > tools/perf/Documentation/perf-c2c.txt | 7 +++++++
> > tools/perf/builtin-c2c.c | 22 +++++++++++++---------
> > tools/perf/util/cacheline.h | 25 ++++++++++++++++++++-----
> > tools/perf/util/sort.c | 13 ++++++++++---
> > tools/perf/util/sort.h | 1 +
> > 5 files changed, 51 insertions(+), 17 deletions(-)
> >
> > diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
> > index 4e8c263e1721..cc18d61ec5d5 100644
> > --- a/tools/perf/Documentation/perf-c2c.txt
> > +++ b/tools/perf/Documentation/perf-c2c.txt
> > @@ -130,6 +130,13 @@ REPORT OPTIONS
> > The known limitations include exception handing such as
> > setjmp/longjmp will have calls/returns not match.
> >
> > +-a::
> > +--double-cl::
> > + Group the detection of shared cacheline events into double cacheline
> > + granularity. Some architectures have an Adjacent Cacheline Prefetch
> > + feature, which causes cacheline sharing to behave like the cacheline
> > + size is doubled.
> > +
>
> Humm, this is something not that usual, so I think we should have it
> just as --double-cl, ok?
>
> I can do the adjustment here if you agree.

Sure. Many thanks for the review and suggestion!

- Feng

> Thanks,
>
> - Arnaldo