Re: [RESEND PATCH 2/2] perf/x86: improve the event scheduling to avoid unnecessary pmu_stop/start

From: Stephane Eranian
Date: Mon Mar 07 2022 - 12:14:32 EST


On Sun, Mar 6, 2022 at 6:36 AM Wen Yang <wenyang@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> 在 2022/3/4 下午11:39, Peter Zijlstra 写道:
> > On Fri, Mar 04, 2022 at 07:03:51PM +0800, Wen Yang wrote:
> >> this issue has been there for a long time, we could reproduce it as follows:
> >
> > What issue? You've not described an issue. So you cannot reference one.
> >
>
> OK, thank you for your opinion. Let's explain it.
>
>
> > This is still completely unreadable gibberish.
> >
> >> 1, run a script that periodically collects perf data, eg:
> >> while true
> >> do
> >> perf stat -e cache-misses,cache-misses,cache-misses -c 1 sleep 2
> >> perf stat -e cache-misses -c 1 sleep 2
> >> sleep 1
> >> done
> >>
> >> 2, run another one to capture the ipc, eg:
> >> perf stat -e cycles:d,instructions:d -c 1 -i 1000
> >
> > <snip line noise>
> >
> >> the reason is that the nmi watchdog permanently consumes one fp
> >> (*cycles*). therefore, when the above shell script obtains *cycles*
> >> again, only one gp can be used, and its weight is 5.
> >> but other events (like *cache-misses*) have a weight of 4,
> >> so the counter used by *cycles* will often be taken away, as in
> >> the raw data above:
> >> [1]
> >> n_events = 3
> >> assign = {33, 1, 32, ...}
> >> -->
> >> n_events = 6
> >> assign = {33, 3, 32, 0, 1, 2, ...}
> >
> > Again unreadable... what do any of those numbers mean?
> >
>
> Scenario: a monitor program will monitor the CPI on a specific CPU in
> pinned mode for a long time, such as the CPI in the original email:
> perf stat -e cycles:D,instructions:D -C 1 -I 1000
>
> Perf here is just an example. Our monitor program will continuously read
> the counter values of these perf events (cycles and instructions).
>
> However, it is found that CPI data will be abnormal periodically because
> PMU counter conflicts. For example, scripts in e-mail will cause
> interference:
> perf stat -e cache-misses,cache-misses,cache-misses -C 1 sleep 2
> perf stat -e cache-misses -C 1 sleep 2
>
> n_events = 3
> assign = {33, 1, 32, ...}
> -->
> n_events = 6
> assign = {33, 3, 32, 0, 1, 2, ...}
>
> They are some fields of cpu_hw_events structure.
> int n_events; /* the # of events in the below arrays */
> int assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
> struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */
>
> 32: intel_pmc_idx_fixed_instructions
> 33: intel_pmc_idx_fixed_cpu_cycles
> 34: intel_pmc_idx_fixed_ref_cycles
> 0,1,2,3: gp
>
>
> n_events = 3
> assign = {33, 1, 32, ...}
> event_list = {0xffff88bf77b85000, 0xffff88b72db82000,
> 0xffff88b72db85800, ...}
>
> —>
> Indicates that there are 3 perf events on CPU 1, which occupy the
> #fixed_cpu_cycles, #1 and #fixed_instruction counter one by one.
> These 3 perf events are generated by the NMI watchdog and the script A:
> perf stat -e cycles:D,instructions:D -C 1 -I 1000
>
>
> n_events = 6
> assign = {33, 3, 32, 0, 1, 2, ...}
> event_list = {0xffff88bf77b85000, 0xffff88b72db82000,
> 0xffff88b72db85800, 0xffff88bf46c34000, 0xffff88bf46c35000,
> 0xffff88bf46c30000, ...}
>
> —>
> Indicates that there are 6 perf events on CPU 1, which occupy the
> #fixed_cpu_cycles, #3, #fixed_instruction, #0, #1 and #2 counter one by one.
> These 6 perf events are generated by the NMI watchdog and the script A
> and B:
> perf stat -e cycles:D,instructions:D -C 1 -I 1000
> perf stat -e cache-misses,cache-misses,cache-misses -C 1 sleep 2
>
> 0xffff88bf77b85000:
> The perf event generated by NMI watchdog, which has always occupied
> #fixed_cpu_cycles
>
> 0xffff88b72db82000:
> The perf event generated by the above script A (cycles:D), and the
> counter it used changes from #1 to #3. We use perf event in pinned mode,
> and then continuously read its value for a long time, but its PMU
> counter changes, so the counter value will also jump.
>
What you are describing here is working as expected. Pinning an event DOES NOT
mean pinning the event to a counter for the whole measurement. It
means the event
will always be measured on "a" counter. But that counter can change
overtime. Moving
the event to a different counter SHOULD NOT change its value assuming
you use official
API to read the counter, i.e., the read() syscall or rdpmc using the
official guideline for it.

>
>
> 0xffff88b72db85800:
> The perf event generated by the above script A (instructions:D), which
> has always occupied #fixed_instruction.
>
> 0xffff88bf46c34000, 0xffff88bf46c35000, 0xffff88bf46c30000:
> Theses perf events are generated by the above script B.
>
>
> >>
> >> so it will cause unnecessary pmu_stop/start and also cause abnormal cpi.
> >
> > How?!?
> >
>
> We may refer to the x86_pmu_enable function:
> step1: save events moving to new counters
> step2: reprogram moved events into new counters
>
> especially:
>
> static inline int match_prev_assignment(struct hw_perf_event *hwc,
> struct cpu_hw_events *cpuc,
> int i)
> {
> return hwc->idx == cpuc->assign[i] &&
> hwc->last_cpu == smp_processor_id() &&
> hwc->last_tag == cpuc->tags[i];
> }
>
>
>
> >> Cloud servers usually continuously monitor the cpi data of some important
> >> services. This issue affects performance and misleads monitoring.
> >>
> >> The current event scheduling algorithm is more than 10 years old:
> >> commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")
> >
> > irrelevant
> >
>
> commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")
>
> This commit is the basis of the perf event scheduling algorithm we
> currently use.
> The reason why the counter above changed from #1 to #3 can be found from it:
> The algorithm takes into account the list of counter constraints
> for each event. It assigns events to counters from the most
> constrained, i.e., works on only one counter, to the least
> constrained, i.e., works on any counter.
>
> the nmi watchdog permanently consumes one fp (*cycles*).
> therefore, when the above shell script obtains *cycles:D*
> again, it has to use a GP, and its weight is 5.
> but other events (like *cache-misses*) have a weight of 4,
> so the counter used by *cycles:D* will often be taken away.
>
And that 's normal. Measuring cycles on a generic counter or fixed
counter does not
change what is counted or the precision of it.

>
> In addition, we also found that this problem may affect NMI watchdog in
> the production cluster.
> The NMI watchdog also uses a fixed counter in fixed mode. Usually, it is
> The first element of the event_list array, so it usually takes
> precedence and can get a fixed counter.


Not true. The NMI watchdog does not request a fixed counter. There is
no way to do this.
It just lands on the fixed counter because the scheduling algorithm
favors fixed counters whenever
possible.

>
> But if the administrator closes the watchdog first and then enables it,
> it may be at the end of the event_list array, so its expected fixed
> counter may be occupied by other perf event, and it can only use one GP.
> In this way, there is a similar issue here: the PMU counter used by the
> NMI watchdog may be disabled/enabled frequently and unnecessarily.
>
Yes, and I don't think this is a problem.
You are trying to get counter guarantee but the interface DOES NOT
give this to users
and should not in my opinion.

>
> Any advice or guidance on this would be appreciated.
>
> Best wishes,
> Wen
>
>
> >> we wish it could be optimized a bit.
> >
> > I wish for a unicorn ...
> >
> >> The fields msk_counters and msk_events are added to indicate currently
> >> used counters and events so that the used ones can be skipped
> >> in __perf_sched_find_counter and perf_sched_next_event functions to avoid
> >> unnecessary pmu_stop/start.
> >
> > Still not sure what your actual problem is, nor what the actual proposal
> > is.
> >
> > Why should I attempt to reverse engineer your code without basic
> > understanding of what you're actually trying to achieve?
> >
>
>
>