Re: [PATCH 1/1] x86/cqm: Cqm requirements

From: Thomas Gleixner
Date: Fri Mar 10 2017 - 09:53:46 EST


On Thu, 9 Mar 2017, David Carrillo-Cisneros wrote:
> On Thu, Mar 9, 2017 at 3:01 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > On Wed, 8 Mar 2017, David Carrillo-Cisneros wrote:
> >> On Wed, Mar 8, 2017 at 12:30 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >> > Same applies for per CPU measurements.
> >>
> >> For CPU measurements. We need perf-like CPU filtering to support tools
> >> that perform low overhead monitoring by polling CPU events. These
> >> tools approximate per-cgroup/task events by reconciling CPU events
> >> with logs of what job run when in what CPU.
> >
> > Sorry, but for CQM that's just voodoo analysis.
>
> I'll argue that. Yet, perf-like CPU is also needed for MBM, a less
> contentious scenario, I believe.

MBM is a different playground (albeit related due to the RMID stuff).

> It does not work well for a single run (your example). But for the
> example I gave, one can just rely on Random Sampling, Law of Large
> Numbers, and Central Limit Theorem.

Fine. So we need this for ONE particular use case. And if that is not well
documented including the underlying mechanics to analyze the data then this
will be a nice source of confusion for Joe User.

I still think that this can be done differently while keeping the overhead
small.

You look at this from the existing perf mechanics which require high
overhead context switching machinery. But that's just wrong because that's
not how the cache and bandwidth monitoring works.

Contrary to the other perf counters, CQM and MBM are based on a context
selectable set of counters which do not require readout and reconfiguration
when the switch happens.

Especially with CAT in play, the context switch overhead is there already
when CAT partitions need to be switched. So switching the RMID at the same
time is basically free, if we are smart enough to do an equivalent to the
CLOSID context switch mechanics and ideally combine both into a single MSR
write.

With that the low overhead periodic sampling can read N counters which are
related to the monitored set and provide N separate results. For bandwidth
the aggregation is a simple ADD and for cache residency it's pointless.

Just because perf was designed with the regular performance counters in
mind (way before that CQM/MBM stuff came around) does not mean that we
cannot change/extend that if it makes sense.

And looking at the way Cache/Bandwidth allocation and monitoring works, it
makes a lot of sense. Definitely more than shoving it into the current mode
of operandi with duct tape just because we can.

Thanks,

tglx