Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

From: David Carrillo-Cisneros
Date: Fri Jan 20 2017 - 02:58:07 EST


On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
<vikas.shivappa@xxxxxxxxxxxxxxx> wrote:
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon.

The second version would require to build a whole new set of tools,
deploy them and maintain them. Users will have to run perf for certain
events and resmon (or whatever is named the new tool) for rdt. I see
it as too complex and much prefer to keep using perf.

> The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> --------------------------------------------------------------------
> --------------------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> ----------------------
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.

That's simply recreating the resctrl group as a cgroup.

I think that the main advantage of doing allocation first is that we
could use the context switch in rdt allocation and greatly simplify
the pmu side of it.

If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.

Then we only need a way to express that monitoring must happen in a
resctl to the perf_event_open syscall.

My first thought is to have a "rdt_monitor" file per resctl group. A
user passes it to perf_event_open in the way cgroups are passed now.
We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also
cover rdt_monitor files. The syscall can figure if it's a cgroup or a
rdt_group. The rdt_monitoring PMU would only work with rdt_monitor
groups

Then the rdm_monitoring PMU will be pretty dumb, having neither task
nor CPU contexts. Just providing the pmu->read and pmu->event_init
functions.

Task monitoring can be done with resctrl as well by adding the PID to
a new resctl and opening the event on it. And, since we'd allow CLOSID
to be shared between resctrl groups, allocation wouldn't break.

It's a first idea, so please dont hate too hard ;) .

David

>
> Monitoring cqm cgroups Implementation
> -------------------------------------
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> switch.
>
> The cqm_info is added to the perf_cgroup structure to maintain this
> information. The structure is allocated and destroyed at css_alloc and
> css_free. All the events tied to a cgroup can use the same
> information while reading the counts.
>
> struct perf_cgroup {
> #ifdef CONFIG_INTEL_RDT_M
> void *cqm_info;
> #endif
> ...
>
> }
>
> struct cqm_info {
> bool mon_enabled;
> int level;
> u32 *rmid;
> struct cgrp_cqm_info *mfa;
> struct list_head tskmon_rlist;
> };
>
> Due to the hierarchical nature of cgroups, every cgroup just
> monitors for the 'nearest monitored ancestor' at all times.
> Since root cgroup is always monitored, all descendents
> at boot time monitor for root and hence all mfa points to root
> except for root->mfa which is NULL.
>
> 1. RMID setup: When cgroup x start monitoring:
> for each descendent y, if y's mfa->level < x->level, then
> y->mfa = x. (Where level of root node = 0...)
> 2. sched_in: During sched_in for x
> if (x->mon_enabled) choose x->rmid
> else choose x->mfa->rmid.
> 3. read: for each descendent of cgroup x
> if (x->monitored) count += rmid_read(x->rmid).
> 4. evt_destroy: for each descendent y of x, if (y->mfa == x)
> then y->mfa = x->mfa. Meaning if any descendent was monitoring for
> x, set that descendent to monitor for the cgroup which x was
> monitoring for.
>
> To monitor a task in a cgroup x along with monitoring cgroup x itself
> cqm_info maintains a list of tasks that are being monitored in the
> cgroup.
>
> When a task which belongs to a cgroup x is being monitored, it
> always uses its own task->rmid even if cgroup x is monitored during sched_in.
> To account for the counts of such tasks, cgroup keeps this list
> and parses it during read.
> taskmon_rlist is used to maintain the list. The list is modified when a
> task is attached to the cgroup or removed from the group.
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
>
> # echo "L3:0=3ff" > schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1
>
> # perf stat -e llc_occupancy -C 0-1
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> To monitor the same group of tasks create a cgroup g1
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # perf stat -e llc_occupancy -a -G g1
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # perf stat -e llc_occupancy -t PIDx,PIDy
>
> RMID allocation
> ---------------
>
> RMIDs are allocated per package to achieve better scaling of RMIDs.
> RMIDs are plenty (2-4 per logical processor) and also are per package
> meaning a two socket system would have twice the number of RMIDs.
> If we still run out of RMIDs an error is thrown that monitoring wasnt
> possible as the RMID wasnt available.
>
> Kernel Scheduling
> -----------------
>
> During ctx switch cqm choses the RMID in the following priority
>
> 1. if cpu has a RMID , choose that
> 2. if the task has a RMID directly tied to it choose that (task is
> monitored)
> 3. choose the RMID of the task's cgroup (by default tasks belong to root
> cgroup with RMID 0)
>
> Read
> ----
>
> When user calls cqm to retrieve the monitored count, we read the
> counter_msr and return the count. For cgroup hierarcy , the count is
> measured as explained in the cgroup implementation section by traversing
> the cgroup hierarchy.
>
>
> 2> Second Design option: Build a new usermode tool resmon
> ---------------------------------------------------------
> ---------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same).
>
> This version supports monitoring resctrl groups directly.
> But we need a user interface for the user to read the counters. We can
> create one file to set monitoring and one
> file in resctrl directory which will reflect the counts but may not be
> efficient as a lot of times user reads the counts frequently.
>
> Build a new user mode interface resmon
> --------------------------------------
>
> Since modifying the existing perf to
> suit the different h/w architecture seems to not follow the CAT
> interface model, it may well be better to have a different and dedicated
> interface for the RDT monitoring (just like we had a new fs for CAT)
>
> resmon supports monitoring a resctrl group or a task. The two modes may
> provide enough granularity needed for monitoring
> -can monitor cpu data.
> -can monitor per resctrl group data.
> -can choose custom or subset of tasks with in a resctrl group and monitor.
>
> # resmon [<options>]
> -r <resctrl group>
> -t <PID>
> -s <mon_mask>
> -I <time in ms>
>
> "resctrl group": is the resctrl directory.
>
> "mon_mask: is a bit mask of logical packages which indicates which packages user is
> interested in monitoring.
>
> "time in ms": The time for which the monitoring takes place
> (this can potentially be changed to start and stop/read options)
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
> # mkdir p0
> # echo "L3:0=3ff" > p0/schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1 for 10s.
>
> # resmon -r p0 -s 1 -I 10000
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> Monitor the task for 5s on socket zero
>
> # resmon -r p1 -s 1 -I 5000
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # resmon -t PIDx,PIDy -s 1 -I 10000
>
> returns the sum of count of PIDx and PIDy
>
> RMID Allocation
> ---------------
>
> This would remain the same like design version 1, where we support per
> package RMIDs and throw error when out of RMIDs due to h/w limited
> RMIDs.
>
>