Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

From: Marcelo Tosatti
Date: Fri Aug 07 2015 - 09:15:39 EST


On Thu, Aug 06, 2015 at 01:46:06PM -0700, Vikas Shivappa wrote:
>
>
> On Wed, 5 Aug 2015, Marcelo Tosatti wrote:
>
> >On Wed, Aug 05, 2015 at 01:22:57PM +0100, Matt Fleming wrote:
> >>On Sun, 02 Aug, at 12:31:57PM, Tejun Heo wrote:
> >>>
> >>>But we're doing it the wrong way around. You can do most of what
> >>>cgroup interface can do with systemcall-like interface with some
> >>>inconvenience. The other way doesn't really work. As I wrote in the
> >>>other reply, cgroups is a horrible programmable interface and we don't
> >>>want individual applications to interact with it directly and CAT's
> >>>use cases most definitely include each application programming its own
> >>>cache mask.
> >>
> >>I wager that this assertion is wrong. Having individual applications
> >>program their own cache mask is not going to be the most common
> >>scenario.
> >
> >What i like about the syscall interface is that it moves the knowledge
> >of cache behaviour close to the application launching (or inside it),
> >which allows the following common scenario, say on a multi purpose
> >desktop:
> >
> >Event: launch high performance application: use cache reservation, finish
> >quickly.
> >Event: cache hog application: do not thrash the cache.
> >
> >The two cache reservations are logically unrelated in terms of
> >configuration, and configured separately do not affect each other.
>
> There could be several issues to let apps allocate the cache
> themselves. We just cannot treat the cache alloc just like memory
> allocation, please consider the scenarios below:
>
> all examples consider cache size : 10MB. cbm max bits : 10
>
>
> (1)user programmable syscall:
>
> 1.1> Exclusive access: The task cannot give *itself* exclusive
> access from using the cache. For this it needs to have visibility of
> the cache allocation of other tasks and may need to reclaim or
> override others cache allocs which is not feasible (isnt that the
> ability of a system managing agent?).

Different allocation of the resource (cache in this case) causes
different cache miss patterns and therefore different results.

> eg:
> app1... 10 ask for 1MB of exclusive cache each.
> they get it as there was 10MB.
>
> But now a large portion of tasks on the system will end up without any cache ? -
> this is not possible
> or do they share a common pool or a default shared pool ? - if there is such a
> default pool then that needs to be *managed* and this reduces the
> number of exclusive cache access given.

The proposal would be for the administrator to setup how much each user
can reserve via ulimit (per-user).
To change that per-user configuration, its necessary to
stop the tasks.

However, that makes no sense, revoking crossed my mind as well.
To allow revoking it would be necessary to have a special capability
(which only root has by default).

The point here is that it should be possible to modify cache
reservations.

Alternatively, use a priority system. So:

Revoking:
--------
Priviledged systemcall to list and invalidate cache reservations.
Assumes that reservations returned by "sys_cache_reservation"
are persistent and that users of the "remove" system call
are aware of the consequences.

Priority:
---------
Use some priority order (based on nice value, or a new separate
value to perform comparison), and use that to decide which
reservations have priority.

*I-1* (todo notes)


> 1.2> Noisy neighbour problem: how does the task itself decide its the noisy
> neighbor ? This is the
> key requirement the feature wants to address. We want to address the
> jitter and inconsistencies in the quality of service things like
> response times the apps get. If you read the SDM its mentioned
> clearly there as well. can the task voluntarily declare itself
> noisy neighbour(how ??) and relinquish the cache allocation (how
> much ?). But thats not even guaranteed.

I suppose this requires global information (how much cache each
application is using), and the goal: what is the end goal of
a particular cache resource division.

Each cache division has an outcome: certain instruction sequences
execute faster than others.

Whether a given task is a "cache hog" (that is, evicting cachelines
of other tasks does not reduce execution time of the "cache hog" task
itself, and therefore does not benefit the performance of the system
as a whole) is probably not an ideal visualization: each task has
different subparts that could be considered "cache hogs", and parts
that are not "cache hogs".

I think that for now, handling the static usecases is good enough.

> How can we expect every application coder to know what system the
> app is going to run and how much is the optimal amount of cache the
> app can get - its not like memory allocation for #3 and #4 below.

"Optimal" depends on what the desired end result is: execution time as
a whole, execution time of an individual task, etc.

In the case the applications are not aware of the cache, the OS should
divide the resource automatically by heuristics (in analogy with LRU).

For special applications, the programmer/compiler can find the optimal
tuning.

> 1.3> cannot treat cache allocation similar to memory allocation.
> there is system-calls alternatives to do memory allocation apart from cgroups
> like cpuset but we cannot treat both as the same.
> (This is with reference to the point that there are alternatives to memory
> allocation apart from using cpuset, but the whole point is you cant
> treat memory allocation and cache allocation as same)
> 1.3.1> memory is a very large pool in terms of GBs and we are talking
> about only a few MBs (~10 - 20 orders and orders of magnitude). So
> this could easily get into a situation mentioned
> above where a few first apps get all the exclusive cache and the rest have to
> starve.

Point. applications are allowed to set their cache reservations because
its convenient: its easier to consider and setup cache allocation of
a given application rather than have to consider and setup the whole
system.

If setting reservations individually conflicts or affects the system as
a whole, then the administrator or decision logic should resolve the
situation.

> 1.3.2> memory is virtualized : each process has its own space and we are
> not even bound by the physical memory capacity as we can virtualize
> it so an app can indeed ask for more memory than the physical memory
> along with other apps doing the same - but we cant do the same here
> with cache allocation. Even if we evict the cache , that defeats the
> purpose of cache allocation to threads.

ulimit.

*I-2*

> 1.4> specific h/w requirements : With code data prioritization(cdp) , the h/w
> requires the OS to reset all the capacity bitmasks once we change mode
> from to legacy cache alloc. So
> naturally we need to remove the tasks with all its allocations. We cannot
> easily take away all the cache allocations that users will be thinking is theirs
> when they had allocated using the syscall. This is something like the tasks
> malloc successfully and midway their allocation is no more there.
> Also this would add to the logic that you need to treat the cache allocation and
> other resource allocation like memory differently.

Point.

*I-3*
CPD -> CAT transition.
CAT -> CDP transition.

>
> 1.5> In cloud and container environments , say we would need to
> allocate cache for entire VM which runs a specific real_time
> workload vs. allocate cache for VMs which run say noisy_workload -
> how can we achieve this by letting each app decide how much cache
> that needs to be allocated ? This is best done by an external system
> manager.

Agreed. This is what will happen in that use-case, and the systemcall
interface allows it.

>
> (2)cgroup interface:
>
> (2.1) compare above usage
>
> 1.1> and 1.2> above can easily be done with cgroup interface.
> The key difference is system management and process-self management of the cache
> allocation. When there is a centralized system manager this works fine.
>
> The administrator can
> make sure that certain tasks/group of tasks get exclusive cache blocks. And the
> administrator can determine the noisy neighbour application or workload using
> cache monitoring and make allocations appropriately.
>
> A classic use case is here :
> http://www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html
>
> $ cd /sys/fs/cgroup/rdt
> $ cd group1
> $ /bin/echo 0xf > intel_rdt.l3_cbm
>
> $ cd group2
> $ /bin/echo 0xf0 > intel_rdt.l3_cbm
>
> If we want to prevent the system admin to accidentally allocating
> overlapping masks, that could be easily extended by having an
> always-exclusive flag.
>
> Rounding off: We can easily write a batch file to calculate the
> chunk size and show and then allocate based on byte size. This is
> something that can easily be done on top of this interface.

Agree byte specification can be done in cgroups.

> Assign tasks to the group2
>
> $ /bin/echo PID1 > tasks
> $ /bin/echo PID2 > tasks
>
> If a bunch of threads belonging to a process(Processidx) need to be allocated
> cache -
> $ /bin/echo <Processidx> > cgroup.procs
>
>
> the 4> above can possibly be addressed in cgroup but would need some support
> which we are planning to send. One way to address this is to tear down
> the subsystem by deleting all the existing cgroup directories and then handling
> the reset. So the cdp starts fresh with all bitmasks ready to be allocated.

Agree this is a very good point. The syscall interface must handle it.

*I-4*

> (2.2) cpu affinity :
>
> Similarly rdt cgroup can be used to assign affinity to the entire cgroup itself.
> Also you could always use taskset as well !
>
> example2: Below commands allocate '1MB L3 cache on socket1 to group1'
> and '2MB of L3 cache on socket2 to group2'.
> This mounts both cpuset and intel_rdt and hence the ls would list the
> files in both the subsystems.
> $ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
> $ ls /sys/fs/cgroup/rdt
> cpuset.cpus
> cpuset.mem
> ...
> intel_rdt.l3_cbm
> tasks
>
> Assign the cache
> $ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
> $ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm
>
> Assign tasks for group1 and group2
> $ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
> $ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
> $ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
> $ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks
>
> Tie the group1 to socket1 and group2 to socket2
> $ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
> $ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus
>
> >
> >They should be configured separately.
> >
> >Also, data/code reservation is specific to the application, so it
> >should its specification should be close to the application (its just
> >cumbersome to maintain that data somewhere else).
> >
> >>Only in very specific situations would you trust an
> >>application to do that.
> >
> >Perhaps ulimit can be used to allow a certain limit on applications.
>
> The ulimit is very subjective and depends on the workloads/amount of
> cache space available/total cache etc - see here you are moving
> towards a controlling agent which could possibly configure ulimit to
> control what apps get

The point of ulimit is to let unrestricted users to use cache
reservations as well. So for example one configuration would be

HW: 32MB L3 cache.

user maximum cache reservation
root 32MB.
user-A 1MB.
user-B 1MB.
user-C 1MB.
...

But you'd probably want to say "no more than 2MB for
non-root". Don't think ulimit can handle that.

> >>A much more likely use case is having the sysadmin carve up the cache
> >>for a workload which may include multiple, uncooperating applications.
> >
> >Sorry, what cooperating means in this context?
>
> see example 1.2 above - a noisy neighbour cant be expected to
> relinquish the cache alloc himself. thats one example of
> uncooperating app ?

OK.

> >
> >>Yes, a programmable interface would be useful, but only for a limited
> >>set of workloads. I don't think it's how most people are going to want
> >>to use this hardware technology.
> >
> >It seems syscall interface handles all usecases which the cgroup
> >interface handles.
> >
> >>--
> >>Matt Fleming, Intel Open Source Technology Center
> >
> >Tentative interface, please comment.
>
> Please discuss the interface details once we are solid on the kind
> of interface itself since we already have reviewed one interface and
> talking about a new one. Otherwise it may miss a lot of and hardware
> requirements like 1.4 above - without that we cant have a complete
> interface ?
>
> Understand the cgroup interface has things like hierarchy which are
> of not much use to the intel_rdt cgroup ? - is that the key issue
> here or the whole 'system management of the cache allocation' the
> issue ?

There are several issues now -- can't say what the key issue is.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/