Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

From: David Carrillo-Cisneros
Date: Wed Feb 01 2017 - 18:12:34 EST


On Wed, Feb 1, 2017 at 12:08 PM Luck, Tony <tony.luck@xxxxxxxxx> wrote:
>
> > I was asking for requirements, not a design proposal. In order to make a
> > design you need a requirements specification.
>
> Here's what I came up with ... not a fully baked list, but should allow for some useful
> discussion on whether any of these are not really needed, or if there is a glaring hole
> that misses some use case:
>
> 1) Able to measure using all supported events (currently L3 occupancy, Total B/W, Local B/W)
> 2) Measure per thread
> 3) Including kernel threads
> 4) Put multiple threads into a single measurement group (forced by h/w shortage of RMIDs, but probably good to have anyway)

Even with infinite hw RMIDs you want to be able to have one RMID per
thread groups to avoid reading a potentially large list of RMIDs every
time you measure one group's event (with the delay and error
associated to measure many RMIDs whose values fluctuate rapidly).

> 5) New threads created inherit measurement group from parent
> 6) Report separate results per domain (L3)
> 7) Must be able to measure based on existing resctrl CAT group
> 8) Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources)
> 9) Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID)

I agree that "Measure per logical CPU" is a requirement, but why is
"pick active RMID in same precedence for task/cpu as CAT picks CLOSID"
one as well? Are we set on handling RMIDs the way CLOSIDs are
handled? there are drawbacks to do so, one is that it would make
impossible to do CPU monitoring and CPU filtering the way is done for
all other PMUs.

i.e. the following commands (or their equivalent in whatever other API
you create) won't work:

a) perf stat -e intel_cqm/total_bytes/ -C 2

or

b.1) perf stat -e intel_cqm/total_bytes/ -C 2 <a_measurement_group>

or

b.2) perf stat -e intel_cqm/llc_occupancy/ -a <a_measurement_group>

in (a) because many RMIDs may run in the CPU and, in (b's) because the
same measurement group's RMID will be used across all CPUs. I know
this is similar to how it is in CAT, but CAT was never intended to do
monitoring. We can do the CAT way and the perf way, or not, but if we
will drop support for perf's like CPU support, it must be explicitly
stated and not an implicit consequence of a design choice leaked into
requirements.

> 10) Put multiple CPUs into a group


11) Able to measure across CAT groups. So that a user can:
A) measure a task that runs on CPUs that are in different CAT groups
(one of Thomas' use case FWICT), and
B) measure tasks even if they change their CAT group (my use case).

>
> Nice to have:
> 1) Readout using "perf(1)" [subset of modes that make sense ... tying monitoring to resctrl file system will make most command line usage of perf(1) close to impossible.


We discussed this offline and I still disagree that it is close to
impossible to use perf and perf_event_open. In fact, I think it's very
simple :

a) We stretch the usage of the pid parameter in perf_event_open to
also allow a PMU specific task group fd (as of now it's either a PID
or a cgroup fd).
b) PMUs that can handle non-cgroup task groups have a special PMU_CAP
flag to signal the generic code to not resolve the fd to a cgroup
pointer and, instead, save it as is in struct perf_event (a few lines
of code).
c) The PMU takes care of resolving the task group's fd.

The above is ONE way to do it, there may be others. But there is a big
advantage on leveraging perf_event_open and ease integration with the
perf tool and the myriads of tools that use the perf API.

12) Whatever fs or syscall is provided instead of perf syscalls, it
should provide total_time_enabled in the way perf does, otherwise is
hard to interpret MBM values.

>
> -Tony
>
>