[PATCH v3 00/46] Cache Monitoring Technology (aka CQM)

From: David Carrillo-Cisneros
Date: Sat Oct 29 2016 - 20:40:32 EST


This series introduces the next iteration of kernel support for the
Cache Monitoring Technology or CMT (formerly Cache QoS Monitoring, CQM)
available in Intel Xeon processors.

Documentation has replaced the Intel CQM name with Intel CMT.
This version renames all code to CMT.

It is rebased at tip x86/core to continue on top of
Fenghua Yu's Intel CAT series (partially merged).

One of the main limitations of the previous version is the inability
to simultaneously monitor:
1) An llc_occupancy CPU event and a cgroup or task llc_occupancy
event in that CPU.
2) cgroup events for cgroups in same descendancy line.
3) cgroup events and any task event whose thread runs in a cgroup
in same descendancy line.

Another limitation is that monitoring for a cgroup was enabled/disabled by
the existence of a perf event for that cgroup. Since the event
llc_occupancy measures changes in occupancy rather than total occupancy,
in order to read meaningful llc_occupancy values, an event should be
enabled for a long enough period of time. The overhead in context switches
caused by the perf events is undesired in some sensitive scenarios.

This series of patches addresses the shortcomings mentioned above and
add some other improvements. The main changes are:
- No more potential conflicts between different events. New
version builds a hierarchy of RMIDs that captures the dependency
between monitored cgroups. llc_occupancy for cgroup is the sum of
llc_occupancies for that cgroup RMID and all other RMIDs in the
cgroups subtree (both monitored cgroups and threads).

- A cgroup integration that allows to start monitoring of a cgroup
without creating a perf event, decreasing the context switch
overhead. Monitoring is controlled by a semicolon separated list
of flags passed to a perf cgroup attribute, e.g.:

echo "1;3;0;1" > cgroup_path/perf_event.cmt_monitoring

CPU packages 0, 1 and 3 have flags > 0 and therefore mark those
packages to monitor using RMIDs even if no perf_event is attached
to the cgroup. The meaning of other flag values are explained
in their own patches.

A perf_event is always required in order to read llc_occupancy.
This cgroup integration uses Intel's PQR code and is intended to
share code with the upcoming Intel's CAT driver.

- A more stable rotation algorithm: New algorithm explicitly
defines SLOs to guarantee that RMIDs are assigned and kept
long enough to produce meaningful occupancy values.

- Reduce impact of stealing/rotation of RMIDs: The new algorithm
tries to assign dirty RMIDs to their previous owners when
suitable, decreasing the error introduced by RMID rotation and
the negative impact of dirty RMIDs that drop occupancy too slowly
when unscheduled.

- Eliminate pmu::count: perf generic's perf_event_count()
perform a quick add of atomic types. The introduction of
pmu::count in the previous CMT series to read occupancy for thread
events changed the behavior of perf_event_count() by performing a
potentially slow IPI and write/read to MSR. It also made pmu::read
to have different behaviors depending on whether the event was a
cpu/cgroup event or a thread. This patches serie removes the custom
pmu::count from CMT and provides a consistent behavior for all
calls of perf_event_read .

- Add error return for pmu::read: Reads to CQM events may fail
due to stealing of RMIDs, even after successfully adding an event
to a PMU. This patch series expands pmu::read with an int return
value and propagates the error to callers that can fail
(ie. perf_read).
The ability to fail of pmu::read is consistent with the recent
changes that allow perf_event_read to fail for transactional
reading of event groups.

- Introduce additional flags to perf_event::group_caps and
perf_event::event_caps: the flags PERF_EV_CAP_READ_ANY_{,CPU_}PKG
allow read of CMT events while an event is inactive, saving
unnecessary IPIs. The flag PERV_EV_CAP_CGROUP_NO_RECURSION prevents
generic code from programming multiple CMT events in a CPU, when
dealing with cgroup hierarchy, since this is unsupported by hw.

This patch series also updates the perf tool to fix error handling and to
better handle the idiosyncrasies of snapshot and per-pkg events.

Support for Intel MBM is yet to be build on-top of this driver.

Changes in 3rd version:
- Rename from CQM to CMT, making it consistent with latest Intel's docs.
- Plenty of fixes requested by Thomas G. Mainly:
- Redesign of pmonr state machine.
- Avoid abuse of lock_nested() by defining static lock_class_key's.
- Simplify locking rules.
- Remove unnecessary macros, inlines and wrappers.
- Remove reliance on WARN_ONs for error handling.
- Use kzalloc.
- Fix comments and line breaks.
- Add high level overview in comments of cmt header file.
- Cleaner device initialization/termination. Still not modular,
I am holding that change until the integration with perf cgroup
is discussed (currently is through architecture specific hooks,
see patch 36).
- Clean up and simplify RMID rotation code.
- Add user specific flags (uflags) for both events and
perf_cgroup.cmt_monitoring to allow No Rotation and No Lazy Allocation
of rmids.
- Use CPU Hotplug state machine.
- No longer need the new hook perf_event_exec to start monitoring after
an exec (hook introduced in v1, removed in this one).
- Remove polling of llc_occupancy for active rmids. Replaced by
asynchronous read (see patch 30).
- Change rmids pools to bitmaps, thus removing "wrapped rmid" (wrmid).
- Removal of per-package pools of wrmids used as temporal objects.
- Added a very useful debugfs node to observe internals such as:
- monr hierarchy.
- per-package data.
- llc_occupancy of rmids.
- Reduction of code size to 66% of v2 (now 641 KBs).
- Rebased to tip x86/cache.

Changes in 2nd version:
- As requested by Peter Z., redo commit history to completely remove
old version of CQM in a single patch.
- Use topology_max_packages and fix build errors reported by
Vikas Shivappa.
- Split largest patches, clean up.
- Rebased to peterz/queue perf/core .


David Carrillo-Cisneros (45):
perf/x86/intel/cqm: remove previous version of CQM and MBM
perf/x86/intel: rename CQM cpufeatures to CMT
x86/intel: add CONFIG_INTEL_RDT_M configuration flag
perf/x86/intel/cmt: add device initialization and CPU hotplug support
perf/x86/intel/cmt: add per-package locks
perf/x86/intel/cmt: add intel_cmt pmu
perf/core: add RDT Monitoring attributes to struct hw_perf_event
perf/x86/intel/cmt: add MONitored Resource (monr) initialization
perf/x86/intel/cmt: add basic monr hierarchy
perf/x86/intel/cmt: add Package MONitored Resource (pmonr)
initialization
perf/x86/intel/cmt: add cmt_user_flags (uflags) to monr
perf/x86/intel/cmt: add per-package rmid pools
perf/x86/intel/cmt: add pmonr's Off and Unused states
perf/x86/intel/cmt: add Active and Dep_{Idle, Dirty} states
perf/x86/intel: encapsulate rmid and closid updates in pqr cache
perf/x86/intel/cmt: set sched rmid and complete pmu start/stop/add/del
perf/x86/intel/cmt: add uflag CMT_UF_NOLAZY_RMID
perf/core: add arch_info field to struct perf_cgroup
perf/x86/intel/cmt: add support for cgroup events
perf/core: add pmu::event_terminate
perf/x86/intel/cmt: use newly introduced event_terminate
perf/x86/intel/cmt: sync cgroups and intel_cmt device start/stop
perf/core: hooks to add architecture specific features in perf_cgroup
perf/x86/intel/cmt: add perf_cgroup_arch_css_{online,offline}
perf/x86/intel/cmt: add monr->flags and CMT_MONR_ZOMBIE
sched: introduce the finish_arch_pre_lock_switch() scheduler hook
perf/x86/intel: add pqr cache flags and intel_pqr_ctx_switch
perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to
pmu::read
perf/x86/intel/cmt: add error handling to intel_cmt_event_read
perf/x86/intel/cmt: add asynchronous read for task events
perf/x86/intel/cmt: add subtree read for cgroup events
perf/core: Add PERF_EV_CAP_READ_ANY_{CPU_,}PKG flags
perf/x86/intel/cmt: use PERF_EV_CAP_READ_{,CPU_}PKG flags in Intel cmt
perf/core: introduce PERF_EV_CAP_CGROUP_NO_RECURSION
perf/x86/intel/cmt: use PERF_EV_CAP_CGROUP_NO_RECURSION in intel_cmt
perf/core: add perf_event cgroup hooks for subsystem attributes
perf/x86/intel/cmt: add cont_monitoring to perf cgroup
perf/x86/intel/cmt: introduce read SLOs for rotation
perf/x86/intel/cmt: add max_recycle_threshold sysfs attribute
perf/x86/intel/cmt: add rotation scheduled work
perf/x86/intel/cmt: add rotation minimum progress SLO
perf/x86/intel/cmt: add rmid stealing
perf/x86/intel/cmt: add CMT_UF_NOSTEAL_RMID flag
perf/x86/intel/cmt: add debugfs intel_cmt directory
perf/stat: revamp read error handling, snapshot and per_pkg events

Stephane Eranian (1):
perf/stat: fix bug in handling events in error state

arch/alpha/kernel/perf_event.c | 3 +-
arch/arc/kernel/perf_event.c | 3 +-
arch/arm64/include/asm/hw_breakpoint.h | 2 +-
arch/arm64/kernel/hw_breakpoint.c | 3 +-
arch/metag/kernel/perf/perf_event.c | 5 +-
arch/mips/kernel/perf_event_mipsxx.c | 3 +-
arch/powerpc/include/asm/hw_breakpoint.h | 2 +-
arch/powerpc/kernel/hw_breakpoint.c | 3 +-
arch/powerpc/perf/core-book3s.c | 11 +-
arch/powerpc/perf/core-fsl-emb.c | 5 +-
arch/powerpc/perf/hv-24x7.c | 5 +-
arch/powerpc/perf/hv-gpci.c | 3 +-
arch/s390/kernel/perf_cpum_cf.c | 5 +-
arch/s390/kernel/perf_cpum_sf.c | 3 +-
arch/sh/include/asm/hw_breakpoint.h | 2 +-
arch/sh/kernel/hw_breakpoint.c | 3 +-
arch/sparc/kernel/perf_event.c | 2 +-
arch/tile/kernel/perf_event.c | 3 +-
arch/x86/Kconfig | 12 +
arch/x86/events/amd/ibs.c | 2 +-
arch/x86/events/amd/iommu.c | 5 +-
arch/x86/events/amd/uncore.c | 3 +-
arch/x86/events/core.c | 3 +-
arch/x86/events/intel/Makefile | 3 +-
arch/x86/events/intel/bts.c | 3 +-
arch/x86/events/intel/cmt.c | 3498 ++++++++++++++++++++++++++++++
arch/x86/events/intel/cmt.h | 344 +++
arch/x86/events/intel/cqm.c | 1766 ---------------
arch/x86/events/intel/cstate.c | 3 +-
arch/x86/events/intel/pt.c | 3 +-
arch/x86/events/intel/rapl.c | 3 +-
arch/x86/events/intel/uncore.c | 3 +-
arch/x86/events/intel/uncore.h | 2 +-
arch/x86/events/msr.c | 3 +-
arch/x86/include/asm/cpufeatures.h | 14 +-
arch/x86/include/asm/hw_breakpoint.h | 2 +-
arch/x86/include/asm/intel_rdt_common.h | 62 +-
arch/x86/include/asm/perf_event.h | 29 +
arch/x86/include/asm/processor.h | 4 +
arch/x86/kernel/cpu/Makefile | 3 +-
arch/x86/kernel/cpu/common.c | 10 +-
arch/x86/kernel/cpu/intel_rdt_common.c | 37 +
arch/x86/kernel/hw_breakpoint.c | 3 +-
arch/x86/kvm/pmu.h | 10 +-
drivers/bus/arm-cci.c | 3 +-
drivers/bus/arm-ccn.c | 3 +-
drivers/perf/arm_pmu.c | 3 +-
include/linux/cpuhotplug.h | 4 +-
include/linux/perf_event.h | 70 +-
kernel/events/core.c | 177 +-
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 3 +
kernel/trace/bpf_trace.c | 4 +-
tools/perf/builtin-stat.c | 42 +-
tools/perf/util/counts.h | 19 +
tools/perf/util/evsel.c | 49 +-
tools/perf/util/evsel.h | 8 +-
tools/perf/util/stat.c | 35 +-
58 files changed, 4361 insertions(+), 1956 deletions(-)
create mode 100644 arch/x86/events/intel/cmt.c
create mode 100644 arch/x86/events/intel/cmt.h
delete mode 100644 arch/x86/events/intel/cqm.c
create mode 100644 arch/x86/kernel/cpu/intel_rdt_common.c

--
2.8.0.rc3.226.g39d4020