Re: [6.1.7][6.2-rc5] perf all metrics test: FAILED!

From: Ian Rogers
Date: Mon Jan 30 2023 - 19:21:01 EST


On Mon, Jan 30, 2023 at 2:04 AM James Clark <james.clark@xxxxxxx> wrote:
>
>
>
> On 30/01/2023 02:24, Sedat Dilek wrote:
> > ?
> >
> > On Mon, Jan 30, 2023 at 12:21 AM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >>
> >> On Sun, Jan 29, 2023 at 1:59 AM Sedat Dilek <sedat.dilek@xxxxxxxxx> wrote:
> >>>
> >>> [ CC LLVM linux folks + Ben from Debian kernel team ]
> >>>
> >>> Hi,
> >>>
> >>> I am playing with LLVM version 16.0.0-rc1 which was released yesterday and PERF.
> >>>
> >>> After building my selfmade LLVM toolchain, I built perf and run some
> >>> perf tests here on my Intel SandyBridge CPU (details see below).
> >>>
> >>> perf all metrics test: FAILED!
> >>>
> >>> ...with both Debian's perf version 6.1.7 and my selfmade version 6.2-rc5.
> >>>
> >>> Just noticed:
> >>>
> >>> Couldn't bump rlimit(MEMLOCK), failures may take place when creating
> >>> BPF maps, etc
> >>>
> >>> Run the below tests with `sudo` - made this go away - still FAILED.
> >>>
> >>> But maybe I am missing to activate some sysfs/debug or whatever other stuff?
> >>
> >> Hi Sedat,
> >>
> >> things have been improving wrt metrics and so this failure may have
> >> just been because of the addition of a previously missing metric. The
> >> rlimit thing shouldn't affect things but maybe file descriptors?
> >> Looking at the test output the issue is:
> >>
> >> ```
> >> Metric 'tma_dram_bound' not printed in:
> >> # Running 'internals/synthesize' benchmark:
> >> Computing performance of single threaded perf event synthesis by
> >> synthesizing events on the perf process itself:
> >> Average synthesis took: 207.680 usec (+- 0.176 usec)
> >> Average num. events: 30.000 (+- 0.000)
> >> Average time per event 6.923 usec
> >> Average data synthesis took: 217.833 usec (+- 0.202 usec)
> >> Average num. events: 161.000 (+- 0.000)
> >> Average time per event 1.353 usec
> >>
> >> Performance counter stats for 'perf bench internals synthesize':
> >>
> >> <not counted> MEM_LOAD_UOPS_RETIRED.LLC_HIT
> >> (0,00%)
> >> <not counted> CYCLE_ACTIVITY.STALLS_L2_PENDING
> >> (0,00%)
> >> <not counted> CPU_CLK_UNHALTED.THREAD
> >> (0,00%)
> >> <not counted> MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS
> >> (0,00%)
> >> ```
> >>
> >> So the test was checking to see whether the tma_dram_bound metric
> >> could be computed on your Sandybridge and it failed. The event counts
> >> below show that every event came back "<not counted>" which is usually
> >> indicative of a permissions problem - it is also not surprising given
> >> this that the metric wasn't computed. You could try repeating the
> >> command the test is trying with something like "perf stat -M
> >> tma_dram_bound -a sleep 1", but running as root should have resolved
> >> that issue. Does that give you enough to keep exploring?
> >>
> >
> > Hi Ian,
> >
> > Thanks for your feedback!
> >
> > I booted into my Debian kernel - just to see what happens.
> >
> > # cat /proc/version
> > Linux version 6.1.0-2-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12
> > (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1
> > SMP PREEMPT_DYNAMIC Debian 6.1.7-1 (2023-01-18)
> >
> > All things run as root...
> >
> > # echo 0 | tee /proc/sys/kernel/kptr_restrict
> > /proc/sys/kernel/perf_event_paranoid
> > 0
> >
> > # /usr/bin/perf test 10 92 98 99 100 101
> > 10: PMU events :
> > 10.1: PMU event table sanity : Ok
> > 10.2: PMU event map aliases : Ok
> > 10.3: Parsing of PMU event table metrics : Ok
> > 10.4: Parsing of PMU event table metrics with fake PMUs : Ok
> > 92: perf record tests : Ok
> > 98: perf stat tests : Ok
> > 99: perf all metricgroups test : Ok
> > 100: perf all metrics test : FAILED!
> > 101: perf all PMU test : Ok
> >
> > # perf stat -M tma_dram_bound -a sleep 1
> >
> > Performance counter stats for 'system wide':
> >
> > <not counted> MEM_LOAD_UOPS_RETIRED.LLC_HIT
> > (0,00%)
> > <not counted> CYCLE_ACTIVITY.STALLS_L2_PENDING
> > (0,00%)
> > <not counted> CPU_CLK_UNHALTED.THREAD
> > (0,00%)
> > <not counted> MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS
> > (0,00%)
> >
>
> Hi Sedat,
>
> I also had this failure and did a git bisect, but it led me to the
> conclusion that it is a stale build issue rather than a regression.
>
> There was a recent commit that renamed/removed some json PMU files which
> the build system can't cope with. I think the tests end up iterating
> over a different set of event names than were generated by the build system.
>
> If you do a clean build the issue should go away. I don't know if there
> is anything more we can do to stop this from happening.
>
> James

So I think this is a kernel bug triggering a perf tool bug. The kernel
bug can be worked around in the perf tool. I only had an Ivybridge to
test with (hence slightly different events) but what I see is both
tma_dram_bound and tma_l3_bound using the same 4 events. I could work
around the "<not counted>" by adding the --metric-no-group flag:

```
$ perf stat -M tma_l3_bound --metric-no-group -a sleep 1

Performance counter stats for 'system wide':

400,404 MEM_LOAD_UOPS_RETIRED.LLC_HIT # 4.3 %
tma_l3_bound (74.99%)
128,937,891 CYCLE_ACTIVITY.STALLS_L2_PENDING
(87.46%)
167,459 MEM_LOAD_UOPS_RETIRED.LLC_MISS
(74.99%)
759,574,967 CPU_CLK_UNHALTED.THREAD
(87.47%)

1.001526438 seconds time elapsed

$ perf stat -M tma_dram_bound -a --metric-no-group sleep 1

Performance counter stats for 'system wide':

259,954 MEM_LOAD_UOPS_RETIRED.LLC_HIT # 15.2 %
tma_dram_bound (74.99%)
118,807,043 CYCLE_ACTIVITY.STALLS_L2_PENDING
(87.46%)
111,699 MEM_LOAD_UOPS_RETIRED.LLC_MISS
(74.95%)
587,571,060 CPU_CLK_UNHALTED.THREAD
(87.45%)

1.001518093 seconds time elapsed
```

The issue is that perf metrics use weak groups of events. A weak group
is the same as a group of events initially. We want to use groups of
events with metrics so that all the counters are scheduled in and out
at the same time, and not multiplexed independently. Imagine measuring
IPC but the counts for instructions and cycles are measured at
different periods, the resultant IPC value would be unlikely to be
accurate. If perf_event_open fails then the perf tool retries the
events without the group. If I try just 3 of the events in a weak
group then the failure can be seen:

```
$ perf stat -e "{MEM_LOAD_UOPS_RETIRED.LLC_HIT,MEM_LOAD_UOPS_RETIRED.LLC_MISS,CYCLE_ACTIVITY.STALLS_L2_PENDING}:W"
-a sleep 1

Performance counter stats for 'system wide':

<not counted> MEM_LOAD_UOPS_RETIRED.LLC_HIT
(0.00%)
<not counted> MEM_LOAD_UOPS_RETIRED.LLC_MISS
(0.00%)
<not counted> CYCLE_ACTIVITY.STALLS_L2_PENDING
(0.00%)

1.001458485 seconds time elapsed
```

The kernel should have failed the perf_event_open on opening the third
event and then measured without the group, which it can do with
multiplexing as in the following:

```
$ perf stat -e "MEM_LOAD_UOPS_RETIRED.LLC_HIT,MEM_LOAD_UOPS_RETIRED.LLC_MISS,CYCLE_ACTIVITY.STALLS_L2_PENDING"
-a sleep 1

Performance counter stats for 'system wide':

1,239,397 MEM_LOAD_UOPS_RETIRED.LLC_HIT
(79.06%)
174,826 MEM_LOAD_UOPS_RETIRED.LLC_MISS
(64.60%)
124,026,024 CYCLE_ACTIVITY.STALLS_L2_PENDING
(81.16%)

1.001483434 seconds time elapsed
```

When the --metric-no-group flag is given to perf then it doesn't
produce the initial weak group, which works around the bug of the
kernel not failing on the 3rd perf_event_open. I've added Kan and
Zhengjun to the e-mail as they work on the Intel kernel PMU code.

There's a question about what we should do in the perf test about
this? I have a few solutions:

1) try metric tests again with the --metric-no-group flag and don't
fail the test if this succeeds. This allows kernel bugs to hide, so
I'm not a huge fan.

2) add a new metric flag/constraint to say not to group, this way the
metric will automatically apply the "--metric-no-group" flag. It is a
bit of work to wire this up but this kind of failure is common enough
in PMUs that it is probably worthwhile. We also need to add the flag
to metrics and I'm not sure how to get a good list of the metrics that
currently fail and require it. This is okay but error prone.

3) fix the kernel bug and let the perf test fail until an adequate
kernel is installed. Probably the best option.

Thanks,
Ian

> > 1,002148600 seconds time elapsed
> >
> > Hmm... looking at... Metric 'tma_l3_bound' ...
> >
> > Running...
> >
> > # perf stat --verbose -M tma_l3_bound -a sleep 1
> > Using CPUID GenuineIntel-6-2A-7
> > metric expr (MEM_LOAD_UOPS_RETIRED.LLC_HIT /
> > (MEM_LOAD_UOPS_RETIRED.LLC_HIT + 7 *
> > MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS)) *
> > CYCLE_ACTIVITY.STALLS_L2_PENDING / CLKS for tma_l3_bound
> > metric expr CPU_CLK_UNHALTED.THREAD for CLKS
> >
> > found event MEM_LOAD_UOPS_RETIRED.LLC_HIT
> > found event CYCLE_ACTIVITY.STALLS_L2_PENDING
> > found event CPU_CLK_UNHALTED.THREAD
> > found event MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS
> >
> > Parsing metric events
> > '{MEM_LOAD_UOPS_RETIRED.LLC_HIT/metric-id=MEM_LOAD_UOPS_RETIRED.LLC_HIT/,CYCLE_ACTIVITY.STALLS_L2_PENDING/metric-id=CYCLE_ACTIVITY.STALLS_L2_PEND
> > ING/,CPU_CLK_UNHALTED.THREAD/metric-id=CPU_CLK_UNHALTED.THREAD/,MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS/metric-id=MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS/}:W'
> > MEM_LOAD_UOPS_RETIRED.LLC_HIT -> cpu/event=0xd1,period=0xc365,umask=0x4/
> > CYCLE_ACTIVITY.STALLS_L2_PENDING ->
> > cpu/event=0xa3,cmask=0x5,period=0x1e8483,umask=0x5/
> > CPU_CLK_UNHALTED.THREAD -> cpu/event=0x3c,period=0x1e8483/
> > MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS -> cpu/event=0xd4,period=0x186a7,umask=0x2/
> >
> > Control descriptor is not initialized
> >
> > MEM_LOAD_UOPS_RETIRED.LLC_HIT: 0 4007421228 0
> > CYCLE_ACTIVITY.STALLS_L2_PENDING: 0 4007421228 0
> > CPU_CLK_UNHALTED.THREAD: 0 4007421228 0
> > MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS: 0 4007421228 0
> >
> > Performance counter stats for 'system wide':
> >
> > <not counted> MEM_LOAD_UOPS_RETIRED.LLC_HIT
> > (0,00%)
> > <not counted> CYCLE_ACTIVITY.STALLS_L2_PENDING
> > (0,00%)
> > <not counted> CPU_CLK_UNHALTED.THREAD
> > (0,00%)
> > <not counted> MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS
> > (0,00%)
> >
> > 1,002310013 seconds time elapsed
> >
> > So those events/metric-ids resulting in "<not counted>" are all found.
> >
> > What means "Control descriptor is not initialized"?
> >
> > To summarize:
> >
> > Those two tests in "100: perf all metrics test" FAILED:
> >
> > 1. tma_dram_bound
> > 2. tma_l3_bound
> >
> > Best regards,
> > -Sedat-
> >
> >> Thanks,
> >> Ian
> >>
> >>> Last perf version which was OK:
> >>>
> >>> ~/bin/perf -v
> >>> perf version 6.0.0
> >>>
> >>> echo "linux-perf: Adjust limited access to performance monitoring and
> >>> observability operations"
> >>> echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
> >>> /proc/sys/kernel/perf_event_paranoid
> >>> 0
> >>>
> >>> ~/bin/perf test 10 86 92 93 94 95
> >>> 10: PMU events :
> >>> 10.1: PMU event table sanity : Ok
> >>> 10.2: PMU event map aliases : Ok
> >>> 10.3: Parsing of PMU event table metrics : Ok
> >>> 10.4: Parsing of PMU event table metrics with fake PMUs : Ok
> >>> 86: perf record tests : Ok
> >>> 92: perf stat tests : Ok
> >>> 93: perf all metricgroups test : Ok
> >>> 94: perf all metrics test : Ok
> >>> 95: perf all PMU test : Ok
> >>>
> >>> echo 1 | sudo tee /proc/sys/kernel/kptr_restrict
> >>> /proc/sys/kernel/perf_event_paranoid
> >>> echo "linux-perf: Reset limited access to performance monitoring and
> >>> observability operations"
> >>>
> >>> If you need further information, please let me know.
> >>>
> >>> Thanks.
> >>>
> >>> Regards,
> >>> -Sedat-
> >>>
> >>> P.S. Instructions
> >>>
> >>> [ REPRODUCER ]
> >>>
> >>> LLVM_MVER="16"
> >>>
> >>> # Debian LLVM
> >>> ##LLVM_TOOLCHAIN_PATH="/usr/lib/llvm-${LLVM_MVER}/bin"
> >>> # Selfmade LLVM
> >>> LLVM_TOOLCHAIN_PATH="/opt/llvm/bin"
> >>> if [ -d ${LLVM_TOOLCHAIN_PATH} ]; then
> >>> export PATH="${LLVM_TOOLCHAIN_PATH}:${PATH}"
> >>> fi
> >>>
> >>> PYTHON_VER="3.11"
> >>> MAKE="make"
> >>> MAKE_OPTS="V=1 -j1 HOSTCC=clang-$LLVM_MVER HOSTLD=ld.lld
> >>> HOSTAR=llvm-ar CC=clang-$LLVM_MVER LD=ld.lld AR=llvm-ar
> >>> STRIP=llvm-strip"
> >>>
> >>> echo "LLVM MVER ........ $LLVM_MVER"
> >>> echo "Path settings .... $PATH"
> >>> echo "Python version ... $PYTHON_VER"
> >>> echo "make line ........ $MAKE $MAKE_OPTS"
> >>>
> >>> LANG=C LC_ALL=C make -C tools/perf clean 2>&1 | tee ../make-log_perf-clean.txt
> >>>
> >>> LANG=C LC_ALL=C $MAKE $MAKE_OPTS -C tools/perf
> >>> PYTHON=python${PYTHON_VER} install-bin 2>&1 | tee
> >>> ../make-log_perf-install_bin_python${PYTHON_VER}_llvm${LLVM_MVER}.txt
> >>>
> >>>
> >>> [ TESTS ]
> >>>
> >>> [ TESTS - START ]
> >>>
> >>> echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
> >>> /proc/sys/kernel/perf_event_paranoid
> >>>
> >>> [ TESTS - DEBIAN ]
> >>>
> >>> /usr/bin/perf -v
> >>> perf version 6.1.7
> >>>
> >>> /usr/bin/perf test 10 92 98 99 100 101
> >>>
> >>> 10: PMU events :
> >>> 10.1: PMU event table sanity : Ok
> >>> 10.2: PMU event map aliases : Ok
> >>> 10.3: Parsing of PMU event table metrics : Ok
> >>> 10.4: Parsing of PMU event table metrics with fake PMUs : Ok
> >>> 92: perf record tests : Ok
> >>> 98: perf stat tests : Ok
> >>> 99: perf all metricgroups test : Ok
> >>> 100: perf all metrics test : FAILED!
> >>> 101: perf all PMU test : Ok
> >>>
> >>> [ TESTS - DILEKS ]
> >>>
> >>> ~/bin/perf -v
> >>> perf version 6.2.0-rc5
> >>>
> >>> ~/bin/perf test 7 87 93 94 95 96
> >>>
> >>> 7: PMU events :
> >>> 7.1: PMU event table sanity : Ok
> >>> 7.2: PMU event map aliases : Ok
> >>> 7.3: Parsing of PMU event table metrics : Ok
> >>> 7.4: Parsing of PMU event table metrics with fake PMUs : Ok
> >>> 87: perf record tests : Ok
> >>> 93: perf stat tests : Ok
> >>> 94: perf all metricgroups test : Ok
> >>> 95: perf all metrics test : FAILED!
> >>> 96: perf all PMU test : Ok
> >>>
> >>> [ TESTS - FAILED ]
> >>>
> >>> /usr/bin/perf test --verbose 100 2>&1 | tee
> >>> perf-test-verbose-100-perf-all-metrics-test_debian-perf-6-1-7.txt
> >>>
> >>> ~/bin/perf test --verbose 95 2>&1 | tee
> >>> perf-test-verbose-95-perf-all-metrics-test_dileks-perf-6-2-rc5.txt
> >>>
> >>> [ TESTS - STOP ]
> >>>
> >>> echo 1 | sudo tee /proc/sys/kernel/kptr_restrict
> >>> /proc/sys/kernel/perf_event_paranoid
> >>>
> >>> - EOT -