Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring

From: James Morse
Date: Fri Jun 13 2025 - 12:57:42 EST

Next message: James Morse: "Re: [PATCH] fs/resctrl: Optimize code in rdt_get_tree()"
Previous message: Ira Weiny: "Re: [PATCH v9 4/7] ACPI: APEI: EINJ: Discover EINJv2 parameters"
Next in thread: Luck, Tony: "Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Tony,

I'm still going through this, but here is my attempt to describe what equivalents arm has
in this area.

On 21/05/2025 23:50, Tony Luck wrote:
> Background
> ----------
>
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOBMSM device to be
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.

This is a shared memory area where an external agent (the OOBMSM) has logged measurement data?

Arm's equivalent to this is two things.
For things close to the CPU (e.g. stalls_llc_miss) these would be an PMU (possibly uncore
PMU) which follow a convention on their register layout meaning the general purpose pmu
driver should be able to drive them. The meaning of the events is described to user-space
via the perf json file. The kernel knows how to read event:6, but not what event:6 means.
The spec for this mentions MPAM, values can be monitored by ~RMID, but none of this is
managed by the MPAM driver.

The other thing arm has that is a bit like this is SCMI, which is a packet format for
talking to an on-die microcontroller to get platform specific temperature, voltage and
clock values. Again, this is another bit of kernel infrastructure that has its own way of
doing things. I don't see this filtering things by ~RMID ... but I guess its possible.
That can have shared memory areas (termed 'fast channels'). I think they are an array of
counter values, and something in the packet stream tells you which one is which.

Neither of these need picking up by the MPAM driver to expose via resctrl. But I'd like to
get that information across where possible so that user-space can be portable.

> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package. In the initial
> hardware implementation, there are two categories of events: energy
> and perf.
>
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.
>
> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).

> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with much lower overhead (no need to collect data
> on every context switch).
>
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC
>
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks
>
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles
>
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches
>
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
>
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
>
> uops_retired - Counts the total number of uops retired
>
> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
>
> Offset: Counter
> 0x00 core energy for RMID 0
> 0x08 core activity for RMID 0
> 0x10 core energy for RMID 1
> 0x18 core activity for RMID 1
> ...

For the performance counters especially, on arm I'd be trying to get these values by
teaching perf about the CLOSID/RMID values, so that perf events are only incremented for
tasks in a particular control/monitor group.
(why that might be relevant is below)

> Resctrl User Interface
> ----------------------
>
> Because there may be multiple OOBMSM collection agents per processor
> package, resctrl accumulates event counts from all agents on a package
> and presents a single value to users. This will provide a consistent
> user interface on future platforms that vary the number of collectors,
> or the mappings from logical CPUs to collectors.

Great!

> Users will continue to see the legacy monitoring files in the "L3"
> directories and the telemetry files in the new "PERF_PKG" directories
> (with each file providing the aggregated value from all OOBMSM collectors
> on that package).
>
> $ tree /sys/fs/resctrl/mon_data/
> /sys/fs/resctrl/mon_data/
> ├── mon_L3_00
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes

> ├── mon_PERF_PKG_00

Where do the package ids come from? How can user-space find out which CPUs are in package-0?

I don't see a package_id in either /sys/devices/system/cpu/cpu0/topology or
Documentation/ABI/stable/sysfs-devices-system-cpu.

> │ ├── activity
> │ ├── c1_res
> │ ├── c6_res
> │ ├── core_energy
> │ ├── stalls_llc_hit
> │ ├── stalls_llc_miss
> │ ├── unhalted_core_cycles
> │ ├── unhalted_ref_cycles
> │ └── uops_retired
> └── mon_PERF_PKG_01
> ├── activity
> ├── c1_res
> ├── c6_res
> ├── core_energy
> ├── stalls_llc_hit
> ├── stalls_llc_miss
> ├── unhalted_core_cycles
> ├── unhalted_ref_cycles
> └── uops_retired

Looks good to me.

The difficulty MPAM platforms have had with mbm_total_bytes et al is the "starts counting
from the beginning of time" property. Having to enable mbm_total_bytes before it counts
would have allowed MPAM to report an error if it couldn't enable more than N counters at a
time. (ABMC suggests AMD platforms have a similar problem).

How do you feel about having to enable these before they start counting?

This would allow the MPAM driver to open the event via perf if it has a corresponding
feature/counter, then provide the value from perf via resctrl.

Another headache is how we describe the format of the contents of these files... a made
up example: residency counts could be in absolute time, or percentages. I've been bitten
by the existing schemata strings being implicitly in a particular format, meaning
conversions have to happen. I'm not sure whether some architecture/platform would trip
over the same problem here.

> Resctrl Implementation
> ----------------------
>
> The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
> that returns an array of structures describing the per-RMID groups it
> found from the VSEC enumeration. Linux looks at the unique identifiers
> for each group and enables resctrl for all groups with known unique
> identifiers.
>
> The memory map for the counters for each <RMID, event> pair is described
> by the XML file. This is too unwieldy to use in the Linux kernel, so a
> simplified representation is built into the resctrl code.

(I hope there are only a few combinations!)

> Note that the
> counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> calls to read counters from a CPU in a specific domain.

Huzzah! RISC-V has this property, and many MPAM platforms do, (...but not all...)

> The counters can be read from any CPU.
>
> High level description of code changes:
>
> 1) New scope RESCTRL_PACKAGE
> 2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
> 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
> switch (r->rid) {
> case RDT_RESOURCE_L3:
> helper for L3
> break;
> case RDT_RESOURCE_PERF_PKG:
> helper for PKG
> break;
> }
> 4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.
>
> With only one platform providing this feature, it's tricky to tell
> exactly where it is going to go. I've made the event definitions
> platform specific (based on the unique ID from the VSEC enumeration). It
> seems possible/likely that the list of events may change from generation
> to generation.

My thinking about this from a perf angle was to have named events for those things that
resctrl supports, but allow events to be specified by number, and funnel those through
resctrl_arch_rmid_read() so that the arch code can interpret them as a counter type-id or
an offset in some array. The idea was to allow platform specific counters to be read
without any kernel changes, reducing the pressure to add resctrl support for counters that
may only ever be present in a single platform.

With the XML data you have, would it be possible to add new 'events' to this interface via
sysfs/configfs? Or does too much depend on the data identified by that GUID...

Thanks,

James

Next message: James Morse: "Re: [PATCH] fs/resctrl: Optimize code in rdt_get_tree()"
Previous message: Ira Weiny: "Re: [PATCH v9 4/7] ACPI: APEI: EINJ: Discover EINJv2 parameters"
Next in thread: Luck, Tony: "Re: [PATCH v5 00/29] x86/resctrl telemetry monitoring"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]