Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

From: Vikas Shivappa
Date: Wed Jan 18 2017 - 21:23:55 EST


Based on Thomas and Peterz feedback Can think of two variants which target:

-Support monitoring and allocating using the same resctrl group.
user can use a resctrl group to allocate resources and also monitor
them (with respect to tasks or cpu)
-allows 'task only' monitoring outside of resctrl. This mode can be used
when user wants to override the RMIDs in the resctrl or when he wants
to monitor more than just the resctrl groups.

option 1> without modifying the resctrl

In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same) but the resctrl groups are mapped to one RMID as well as a CLOSID.

But we need a user interface for the user to read the counters. We can
create one file to set monitoring and one
file in resctrl directory which will reflect the counts but may not be
efficient as a lot of times user reads the counts frequently.

For the user interface there may be two options to do this:

1.a> Build a new user mode interface resmon

Since modifying the existing perf to
suit the different h/w architecture seems to not follow the CAT
interface model, it may well be better to have a different and dedicated
interface for the RDT monitoring (just like we had a new fs for CAT)

$resmon -r <resctrl group> -s <mon_mask> -I <time in ms>

"resctrl group": is the resctrl directory.

"mon_mask: is a bit mask of logical packages which indicates which packages user is
interested in monitoring.

"time in ms": The time for which the monitoring takes place
(this can potentially be changed to start and stop/read options)

Example 1 (Some examples modeled from resctrl ui documentation)
---------

A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share
text and data, so a per task association is not required and due to
interaction with the kernel it's desired that the kernel on these cores shares L3
with the tasks.

# cd /sys/fs/resctrl

# echo "L3:0=3ff" > schemata

core 0-1 are assigned to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.

# echo 03 > p0/cpus

monitor the cpus 0-1 for 5s

# resmon -r p0 -s 1 -I 5000


Example 2
---------

A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
cache.

# cd /sys/fs/resctrl

# mkdir p1
# echo "L3:0=0f00;1=ffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2-3 5678

Monitor the task for 5s on socket zero

# resmon -r p0 -s 1 -I 5000

Example 3
---------

sometimes user may just want to profile the cache occupancy first before
assigning any CLOSids. Also this provides an override option where user
can monitor some tasks which have say CLOS 0 that he is about to place
in a CLOSId based on the amount of cache occupancy. This could apply to
the same real time tasks above where user is caliberating the % of cache
thats needed.

monitor a task PIDx on socket 0 for 10s

# resmon -t PIDx -s 1 -I 10000

1.b> Add a new option to perf apart from supporting the task monitoring
in perf.

- Monitor a resctrl group.

Introduce a new option for perf "-R" which indicates to monitor a
resctrl group.

$mkdir /sys/fs/resctrl/p1
$echo PID1 > /sys/fs/resctrl/p1/tasks
$echo PID2 > /sys/fs/resctrl/p1/tasks

$perf stat -e llc_occupancy -R p1

would return the count for the resctrl group p1.

- Monitor a task outside of resctrl group ('task only')

In this case , the perf can also monitor individual tasks using the -t
option just like before.

$perf stat -e llc_occupancy -t p1

- Monitor CPUs.

For the example 1 above , perf can be used to monitor the resctrl group
p0

$perf stat -e llc_occupancy -t p0

The issue with both options may be what happens when we run out of
RMIDs. For the resctrl groups , since we know the max groups that can be
created and the # of CLOSIds is very less compared to # of RMIDs we
reserve an RMID for each resctrl group so there is never a case that
RMID is not available for resctrl group.
For task monitoring , it can use the rest of the RMIDs.

Why do we need seperate 'task only' monitoring ?
-----------------------------------------

The seperate task monitoring option lets
the user use the RMIDs effectively and not be restricted to # of
CLOSids. Also deal with the scenarios of example 3.

RMID allocation/init
--------------------

resctrl monitoring:
RMIDs are allocated when CLOSIds are allocated during mkdir. One RMId is
allocated per socket just like CLOSid.

task monitoring:
When task events are created, RMIDs are allocated. Can also do a lazy
allocation of RMIDs when the tasks are actually scheduled in on a
socket.

Kernel Scheduling
-----------------

During ctx switch cqm choses the RMID in the following priority (1-
highest priority)

1. if cpu has a RMID , choose that
2. if the task has a RMID directly tied to it choose that
3. choose the RMID of the task's resctrl

Read
----

When user calls cqm to retrieve the monitored count, we read the
counter_msr and return the count.


option 2> Modifying the resctrl

This changes the resctrl interface schemata where user inputs the
CLOSids and RMIDs instead of CBMs.

# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=<closidx>;1=<closidy>" > /sys/fs/resctrl/p0/schemata

There is a mapping between closid and cbm which the user can change.

# echo 0xff > .../config/L3/0/cbm

Display the CLOSids

# ls .../config/L3/
0
1
2
.
.
.
15

As an extension to cqm , this schemata can be modified to also have the
RMIDs be chosen by the user. That way user can configure different RMIDs
for the same CLOSid if needed like in example 3. and also since we have
so many more RMIDs than CLOSids , user is not restricted by the number
of resctrl groups he can create (With the current model, user cannot
create more directories than the number of CLOSIds)

# echo "L3:0=<closidx>,<RMID1>;1=<closidy>,<RMID2>" > /sys/fs/resctrl/p0/schemata

user interface to monitor can be same as shown in the design variant #1
with the difference that this may have a lesser need for the 'task only'
monitoring.