Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

From: Vikas Shivappa
Date: Thu Aug 06 2015 - 16:46:11 EST

Next message: Kamal Mostafa: "[PATCH 3.13.y-ckt 30/53] dm btree: silence lockdep lock inversion in dm_btree_del()"
Previous message: Kamal Mostafa: "[PATCH 3.13.y-ckt 32/53] s390/sclp: clear upper register halves in _sclp_print_early"
In reply to: Marcelo Tosatti: "Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management"
Next in thread: Marcelo Tosatti: "Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 5 Aug 2015, Marcelo Tosatti wrote:

On Wed, Aug 05, 2015 at 01:22:57PM +0100, Matt Fleming wrote:

On Sun, 02 Aug, at 12:31:57PM, Tejun Heo wrote:

But we're doing it the wrong way around. You can do most of what
cgroup interface can do with systemcall-like interface with some
inconvenience. The other way doesn't really work. As I wrote in the
other reply, cgroups is a horrible programmable interface and we don't
want individual applications to interact with it directly and CAT's
use cases most definitely include each application programming its own
cache mask.

I wager that this assertion is wrong. Having individual applications
program their own cache mask is not going to be the most common
scenario.

What i like about the syscall interface is that it moves the knowledge
of cache behaviour close to the application launching (or inside it),
which allows the following common scenario, say on a multi purpose
desktop:

Event: launch high performance application: use cache reservation, finish
quickly.
Event: cache hog application: do not thrash the cache.

The two cache reservations are logically unrelated in terms of
configuration, and configured separately do not affect each other.

There could be several issues to let apps allocate the cache themselves. We just cannot treat the cache alloc just like memory allocation, please consider the scenarios below:

all examples consider cache size : 10MB. cbm max bits : 10

(1)user programmable syscall:

1.1> Exclusive access: The task cannot give *itself* exclusive access from using the cache. For this it needs to have visibility of the cache allocation of other tasks and may need to reclaim or override others cache allocs which is not feasible (isnt that the ability of a system managing agent?).

eg:
app1... 10 ask for 1MB of exclusive cache each.
they get it as there was 10MB.

But now a large portion of tasks on the system will end up without any cache ? -
this is not possible
or do they share a common pool or a default shared pool ? - if there is such a
default pool then that needs to be *managed* and this reduces the number of exclusive cache access given.

1.2> Noisy neighbour problem: how does the task itself decide its the noisy
neighbor ? This is the
key requirement the feature wants to address. We want to address the jitter and inconsistencies in the quality of service things like response times the apps get. If you read the SDM its mentioned clearly there as well. can the task voluntarily declare itself
noisy neighbour(how ??) and relinquish the cache allocation (how much ?). But thats not even guaranteed.
How can we expect every application coder to know what system the app is going to run and how much is the optimal amount of cache the app can get - its not like memory allocation for #3 and #4 below.

1.3> cannot treat cache allocation similar to memory allocation.
there is system-calls alternatives to do memory allocation apart from cgroups
like cpuset but we cannot treat both as the same.
(This is with reference to the point that there are alternatives to memory
allocation apart from using cpuset, but the whole point is you cant treat memory allocation and cache allocation as same)
1.3.1> memory is a very large pool in terms of GBs and we are talking
about only a few MBs (~10 - 20 orders and orders of magnitude). So this could easily get into a situation mentioned
above where a few first apps get all the exclusive cache and the rest have to
starve.
1.3.2> memory is virtualized : each process has its own space and we are
not even bound by the physical memory capacity as we can virtualize it so an app can indeed ask for more memory than the physical memory along with other apps doing the same - but we cant do the same here with cache allocation. Even if we evict the cache , that defeats the purpose of cache allocation to threads.

1.4> specific h/w requirements : With code data prioritization(cdp) , the h/w
requires the OS to reset all the capacity bitmasks once we change mode
from to legacy cache alloc. So
naturally we need to remove the tasks with all its allocations. We cannot
easily take away all the cache allocations that users will be thinking is theirs
when they had allocated using the syscall. This is something like the tasks
malloc successfully and midway their allocation is no more there.
Also this would add to the logic that you need to treat the cache allocation and
other resource allocation like memory differently.

1.5> In cloud and container environments , say we would need to allocate cache for entire VM which runs a specific real_time workload vs. allocate cache for VMs which run say noisy_workload - how can we achieve this by letting each app decide how much cache that needs to be allocated ? This is best done by an external system manager.

(2)cgroup interface:

(2.1) compare above usage

1.1> and 1.2> above can easily be done with cgroup interface.
The key difference is system management and process-self management of the cache
allocation. When there is a centralized system manager this works fine.

The administrator can
make sure that certain tasks/group of tasks get exclusive cache blocks. And the
administrator can determine the noisy neighbour application or workload using
cache monitoring and make allocations appropriately.

A classic use case is here :
http://www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html

$ cd /sys/fs/cgroup/rdt
$ cd group1
$ /bin/echo 0xf > intel_rdt.l3_cbm

$ cd group2
$ /bin/echo 0xf0 > intel_rdt.l3_cbm

If we want to prevent the system admin to accidentally allocating overlapping masks, that could be easily extended by having an always-exclusive flag.

Rounding off: We can easily write a batch file to calculate the chunk size and show and then allocate based on byte size. This is something that can easily be done on top of this interface.

Assign tasks to the group2

$ /bin/echo PID1 > tasks
$ /bin/echo PID2 > tasks

If a bunch of threads belonging to a process(Processidx) need to be allocated
cache -
$ /bin/echo <Processidx> > cgroup.procs

the 4> above can possibly be addressed in cgroup but would need some support
which we are planning to send. One way to address this is to tear down
the subsystem by deleting all the existing cgroup directories and then handling
the reset. So the cdp starts fresh with all bitmasks ready to be allocated.

(2.2) cpu affinity :

Similarly rdt cgroup can be used to assign affinity to the entire cgroup itself.
Also you could always use taskset as well !

example2: Below commands allocate '1MB L3 cache on socket1 to group1'
and '2MB of L3 cache on socket2 to group2'.
This mounts both cpuset and intel_rdt and hence the ls would list the
files in both the subsystems.
$ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
$ ls /sys/fs/cgroup/rdt
cpuset.cpus
cpuset.mem
...
intel_rdt.l3_cbm
tasks

Assign the cache
$ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
$ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm

Assign tasks for group1 and group2
$ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
$ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
$ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
$ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks

Tie the group1 to socket1 and group2 to socket2
$ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
$ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus

They should be configured separately.

Also, data/code reservation is specific to the application, so it
should its specification should be close to the application (its just
cumbersome to maintain that data somewhere else).

Only in very specific situations would you trust an
application to do that.

Perhaps ulimit can be used to allow a certain limit on applications.

The ulimit is very subjective and depends on the workloads/amount of cache space available/total cache etc - see here you are moving towards a controlling agent which could possibly configure ulimit to control what apps get

A much more likely use case is having the sysadmin carve up the cache
for a workload which may include multiple, uncooperating applications.

Sorry, what cooperating means in this context?

see example 1.2 above - a noisy neighbour cant be expected to relinquish the cache alloc himself. thats one example of uncooperating app ?

Yes, a programmable interface would be useful, but only for a limited
set of workloads. I don't think it's how most people are going to want
to use this hardware technology.

It seems syscall interface handles all usecases which the cgroup
interface handles.

--
Matt Fleming, Intel Open Source Technology Center

Tentative interface, please comment.

Please discuss the interface details once we are solid on the kind of interface itself since we already have reviewed one interface and talking about a new one. Otherwise it may miss a lot of and hardware requirements like 1.4 above - without that we cant have a complete interface ?

Understand the cgroup interface has things like hierarchy which are of not much use to the intel_rdt cgroup ? - is that the key issue here or the whole 'system management of the cache allocation' the issue ?

Thanks,
Vikas

The "return key/use key" scheme would allow COSid sharing similarly to
shmget. Intra-application, that is functional, but i am not experienced
with shmget to judge whether there is a better alternative. Would have
to think how cross-application setup would work,
and in the simple "cacheset" configuration.
Also, the interface should work for other architectures (TODO item, PPC
at least has similar functionality).

enum cache_rsvt_flags {
CACHE_RSVT_ROUND_UP = (1 << 0), /* round "bytes" up */
CACHE_RSVT_ROUND_DOWN = (1 << 1), /* round "bytes" down */
CACHE_RSVT_EXTAGENTS = (1 << 2), /* allow usage of area common with external agents */
};

enum cache_rsvt_type {
CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
};

struct cache_reservation {
size_t kbytes;
u32 type;
u32 flags;
};

int sys_cache_reservation(struct cache_reservation *cv);

returns -ENOMEM if not enough space, -EPERM if no permission.
returns keyid > 0 if reservation has been successful, copying actual
number of kbytes reserved to "kbytes".

-----------------

int sys_use_cache_reservation_key(struct cache_reservation *cv, int
key);

returns -EPERM if no permission.
returns -EINVAL if no such key exists.
returns 0 if instantiation of reservation has been successful,
copying actual reservation to cv.

Backward compatibility for processors with no support for code/data
differentiation: by default code and data cache allocation types
fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
information that they done so via "flags").

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Kamal Mostafa: "[PATCH 3.13.y-ckt 30/53] dm btree: silence lockdep lock inversion in dm_btree_del()"
Previous message: Kamal Mostafa: "[PATCH 3.13.y-ckt 32/53] s390/sclp: clear upper register halves in _sclp_print_early"
In reply to: Marcelo Tosatti: "Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management"
Next in thread: Marcelo Tosatti: "Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]