Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

From: Vikas Shivappa
Date: Sun Aug 23 2015 - 14:47:40 EST




On Fri, 21 Aug 2015, Marcelo Tosatti wrote:

On Thu, Aug 20, 2015 at 05:06:51PM -0700, Vikas Shivappa wrote:


On Mon, 17 Aug 2015, Marcelo Tosatti wrote:

Vikas, Tejun,

This is an updated interface. It addresses all comments made
so far and also covers all use-cases the cgroup interface
covers.

Let me know what you think. I'll proceed to writing
the test applications.

Usage model:
------------

This document details how CAT technology is
exposed to userspace.

Each task has a list of task cache reservation entries (TCRE list).

The init process is created with empty TCRE list.

There is a system-wide unique ID space, each TCRE is assigned
an ID from this space. ID's can be reused (but no two TCREs
have the same ID at one time).

The interface accomodates transient and independent cache allocation
adjustments from applications, as well as static cache partitioning
schemes.

Allocation:
Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.

A configurable percentage is reserved to tasks with empty TCRE list.

Hi Vikas,

And how do you think you will do this without a system controlled
mechanism ?
Everytime in your proposal you include these caveats
which actually mean to include a system controlled interface in the
background ,
and your below interfaces make no mention of this really ! Why do we
want to confuse ourselves like this ?
syscall only interface does not seem to work on its own for the
cache allocation scenario. This can only be a nice to have interface
on top of a system controlled mechanism like cgroup interface. Sure
you can do all the things you did with cgroup with the same with
syscall interface but the point is what are the use cases that cant
be done with this syscall only interface. (ex: to deal with cases
you brought up earlier like when an app does cache intensive work
for some time and later changes - it could use the syscall interface
to quickly reqlinquish the cache lines or change a clos associated
with it)

All use cases can be covered with the syscall interface.

* How to convert from cgroups interface to syscall interface:
Cgroup: Partition cache in cgroups, add tasks to cgroups.
Syscall: Partition cache in TCRE, add TCREs to tasks.

You build the same structure (task <--> CBM) either via syscall
or via cgroups.

Please be more specific, can't really see any problem.

Well at first you mentioned that the cgroup does not support specifying size in bytes and percentage and then you eventually agreed to my explanation that you can easily write a bash script to do the same with cgroup bitmasks. (although i had to go through the pain of reading all the proposals you sent without giving a chance to explain how it can be used or so). Then you had a confusion in how I explained the co mounting of the cpuset and intel_rdt and instead of asking a question or pointing out issue, you go ahead and write a whole proposal and in the end even say will cook a patch
before I even try to explain you.
And then you send proposals after proposals which varied from modifying the cgroup interface itself to slightly modifying cgroups and adding syscalls and then also automatically controlling the cache alloc (with all your extend mask capabilities) without understanding what the framework is meant to do or just asking or specifically pointing out any issues in the patch. You had been reviewing the cgroup pathes for many versions unlike others who accepted they need time to think about it or accepted that they maynot understand the feature yet.
So what is that changed in the patches that is not acceptable now ? Many things have been bought up multiple times even after you agreed to a solution already proposed. I was only suggesting that this can be better and less confusing if you point out the exact issue in the patch just like how Thomas or all of the reviewers have been doing. With the rest of the reviewers I either fix the issue or point out a flaw in the review.
If you dont like cgroup interface now , would be best to indicate or discuss the specifics of the shortcommings clearly before sending new proposals. That way we can come up with an interface which does better and works better in linux if we can. Otherwise we may just end up adding more code which just does the same thing?

However I have been working on an alternate interface as well and have just sent it for your ref.


I have repeatedly listed the use cases that can be dealt with , with
this interface. How will you address the cases like 1.1 and 1.2 with
your syscall only interface ?

Case 1.1:
--------

1.1> Exclusive access: The task cannot give *itself* exclusive
access from using the cache. For this it needs to have visibility of
the cache allocation of other tasks and may need to reclaim or
override others cache allocs which is not feasible (isnt that the
ability of a system managing agent?).

Answer: if the application has CAP_SYS_CACHE_RESERVATION, it can
create cache allocation and remove cache allocation from
other applications. So only the administrator could do it.

The 1.1 also includes an other use case(lets call this 1.1.1) which indicates that the apps would just allocate a lot of cache and soon run out space. Hence the first few apps would get most of the cache (would get *most* even if you reserve some % of cache for others - and again thats difficult to assign to the others).

Now if you say you want to put a threshold limit for each app to self allocate , then that turns out to an interface that can easily built on top of the existing cgroup interface. iow its just a control you are giving the app on top of an existing admin controlled interface (like cgroup).the threshold can just be the cbm of the cgroup which the tasks belong to. so now the apps can self allocate or reduce the allocation to something which is a subset the cgroup has (thats one way..)

Also the issue was to discuss whether self allocation or process deciding its own allocation vs. system controlled mechanism. It wasnt clear what syscalls among the ones need to have this sys_cap and which ones would not.


Case 1.2 answer below.

So we expect all the millions of apps
like SAP, oracle etc and etc and all the millions of app developers
to magically learn our new syscall interface and also cooperate
between themselves to decide a cache allocation that is agreeable to
all ? (which btw the interface doesnt list below how to do it) and

They don't have to: the administrator can use "cacheset" application.

the "cacheset" wasnt mentioned before. Now you are talking about a tool which is also doing a centralized or system controlled allocation. This is where I pointed out earlier that its best to keep the discussion to the point and not randomly expand the scope to a variety of other options. If you want to build a taskset like tool thats again just doing a system conrolled interface or a centralized control mechamism which is what cgroup does. Then it just comes down to whether cgroup interface or the cacheset is more easy or intutive. And why would the already widely used interface for resource allocation be not intutive ? - we first need to answer that may be ? or any really required features it lacks ?
Also give that dockers use cgroups for resource allocations , it seems most fit and thats the feedback i received repeatedly in linuxcon as well.


If an application wants to control the cache, it can.

then by some godly powers the noisly neighbour will decide himself
to give up the cache ?

I suppose you imagine something like this:
http://arxiv.org/pdf/1410.6513.pdf

No, the syscall interface does not need to care about that because:

* If you can set cache (CAP_SYS_CACHE_RESERVATION capability),
you can remove cache reservation from your neighbours.

So this problem does not exist (it assumes participants are
cooperative).

There is one confusion in the argument for cases 1.1 and case 1.2:
that applications are supposed to include in their decision of cache
allocation size the status of the system as a whole. This is a flawed
argument. Please point specifically if this is not the case or if there
is another case still not covered.

Like i said it wasnt clear what syscalls required this capability. also the 1.1.1 still breaks this , or iow the apps needs to have lesser control than a system/admin controlled allocation.


It would be possible to partition the cache into watermarks such
as:

task group A - can reserve up to 20% of cache.
task group B - can reserve up to 25% of cache.
task group C - can reserve 50% of cache.

But i am not sure... Tejun, do you think that is necessary?
(CAP_SYS_CACHE_RESERVATION is good enough for our usecases).

(that should be first ever app to not request
more resource in the world for himself and hurt his own performance
- they surely dont want to do social service !)

And how do we do the case 1.5 where the administrator want to assign
cache to specific VMs in a cloud etc - with the hypothetical syscall
interface we now should expect all the apps to do the above and now
they also need to know where they run (what VM , what socket etc)
and then decide and cooperate an allocation : compare this to a
container environment like rancher where today the admin can
convinetly use docker underneath to allocate mem/storage/compute to
containers and easily extend this to include shared l3.

http://marc.info/?l=linux-kernel&m=143889397419199

without addressing the above the details of the interface below is irrelavant -

You are missing the point, there is supposed to be a "cacheset"
program which will allow the admin to setup TCRE and assign them to
tasks.

Your initial request was to extend the cgroup interface to include
rounding off the size of cache (which can easily be done with a bash
script on top of cgroup interface !) and now you are proposing a
syscall only interface ? this is very confusing and will only
unnecessarily delay the process without adding any value.

I suppose you are assuming that its necessary for applications to
set their own cache. This assumption is not correct.

Take a look at Tuna / sched_getaffinity:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html


however like i mentioned the syscall interface or user/app being
able to modify the cache alloc could be used to address some very
specific use cases on top an existing system managed interface. This
is not really a common case in cloud or container environment and
neither a feasible deployable solution.
Just consider the millions of apps that have to transition to such
an interface to even use it - if thats the only way to do it, thats
dead on arrival.

Applications should not rely on interfaces that are not upstream.

Is there an explicit request or comment from users about
their difficulty regarding a change in the interface?

HOwever there needs to be a reasoning on why the cgroup interface is not good as well?


Also please donot include kernel automatically adjusting resources
in your reply as thats totally irrelavent and again more confusing
as we have already exchanged some >100 emails on this same patch
version without meaning anything so far.

The debate is purely between a syscall only interface and a system
manageable interface(like cgroup where admin or a central entity
controls the resources). If not define what is it first before going
into details.

See the Tuna / taskset page.
The administrator could, for example, use "cacheset" from within
the scripts which initialize the applications.
Then having control over those scripts, he can view them as a "unified
system control interface".

Problems with cgroup interface:

1) Global IPI on CBM <---> task change does not scale.

DOnt understand this . how is the IPI related to cgroups. A task is associated with one closid and it needs to carry that along where ever it goes. it supports the use case i explain in (basicaly cloud/container and server user cases mainly)

http://marc.info/?l=linux-kernel&m=144035279828805

2) Syscall interface specification is in kbytes, not
cache ways (which is what must be recorded by the OS
to allow migration of the OS between different
hardware systems).

I thought you agreed that a simple bash script can convert the bitmask to bytes in chunk size. ALl you need is the cache size from /proc/cpuinfo and the max cbm bits in the root intel_rdt cgroup. And its incorrect to say you can do it it bytes. Its only chunk size really. (chunk size = cache size / max cbm bits).
Apart from that the mask gives you the ability to decide an exclusive, overlapping, or partially overlapping and partially exclusive masks.

3) Compilers are able to configure cache optimally for
given ranges of code inside applications, easily,
if desired.

This is again not possible because of 1.1.1. And can be still done in a restricted fashion like i explained above.

4) Does not allow proper usage of shared caches between
applications. Think of the following scenario:
* AppA has threads which are created/destroyed,
but once initialized, want cache reservation.
* How is AppA going to coordinate with cgroups
system to initialized/shutdown cgroups?


Yes , the interface does not support apps to self control cache alloc. That is accepted. But this is not the main use case we target like i explained above and in the link i provided for the new proposal and before.. So its not very important as such.
Also worst case, you can easily design a syscall for apps to self control keeping the cgroup alloc for the task as max threshold.
So lets nail this list(of cgroup flaws you list) down before thinking about changes ? - this should have been the first things in the email really is what i was mentioning.

I started writing the syscall interface on top of your latest
patchset yesterday (it should be relatively easy, given
that most of the low-level code is already there).

Any news on the data/code separation ?

Will send them this week , untested partially due to h/w not yet being with me. Have been ready , but was waiting to see the discussions on this patch as well.

more response below -



Thanks,
Vikas


On fork, the child inherits the TCR from its parent.

Semantics:
Once a TCRE is created and assigned to a task, that task has
guaranteed reservation on any CPU where its scheduled in,
for the lifetime of the TCRE.

A task can have its TCR list modified without notification.

Whey does the task need a list of allocations ? A task is tagged with only one closid and it needs to carry that along. Even if the list is for each socket, that needs be an array.


FIXME: Add a per-task flag to not copy the TCR list of a task but delete
all TCR's on fork.

Interface:

enum cache_rsvt_flags {
CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
};

Not really optional is it ? the chunk size is decided by the h/w sku and you can only allocate in that chunk size, not any bytes.


enum cache_rsvt_type {
CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
};

struct cache_reservation {
unsigned long kbytes;

should be rounded off to chunk size really. And like i explained above the masks let you do the exclusive/partially adjustable percentage exclusive easily (say 20% shared and rest exclusive) or a tolerated amount of shared...

int type;
int flags;
int trcid;
};

The following syscalls modify the TCR of a task:

* int sys_create_cache_reservation(struct cache_reservation *rsvt);
DESCRIPTION: Creates a cache reservation entry, and assigns
it to the current task.

So now i assume this is what the task can do itself and the ones below which pid need the capability ? Again this breaks 1.1.1 like i said above and any way to restrict to a threshold max alloc can just easily be done on top of cgroup alloc keeping the cgroup alloc as max threshold.


returns -ENOMEM if not enough space, -EPERM if no permission.
returns 0 if reservation has been successful, copying actual
number of kbytes reserved to "kbytes", type to type, and tcrid.

* int sys_delete_cache_reservation(struct cache_reservation *rsvt);
DESCRIPTION: Deletes a cache reservation entry, deassigning it
from any task.

Backward compatibility for processors with no support for code/data
differentiation: by default code and data cache allocation types
fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
information that they done so via "flags").

Need to address the change of mode which is dynamic and it may be more intutive to do that in cgroups for the reasons i said above and taking allocation back from a process may need a call back , thats why it may best be to design an interface where the apps know their control is very limited and within the purview of the already set allocations by root user.

Please check the new proposal which tries to addresses the comments i made mostly -
http://marc.info/?l=linux-kernel&m=144035279828805
The framework still lets any kernel mode or high level user mode library developer build a cacheset like tool or others on top of it if that needs to be more custom and more intutive.

Thanks,
Vikas


* int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
DESCRIPTION: Attaches cache reservation identified by "tcrid" to
task by identified by pid.
returns 0 if successful.

* int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
DESCRIPTION: Detaches cache reservation identified by "tcrid" to
task by identified pid.

The following syscalls list the TCRs:
* int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
DESCRIPTION: Return all cache reservations in the system.
Size should be set to the maximum number of items that can be stored
in the buffer pointed to by list.

* int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
DESCRIPTION: Return which pids are associated to tcrid.

* sys_get_pid_cache_reservations(pid_t pid, size_t size,
struct cache_reservation list[]);
DESCRIPTION: Return all cache reservations associated with "pid".
Size should be set to the maximum number of items that can be stored
in the buffer pointed to by list.

* sys_get_cache_reservation_info()
DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
code/data separation is supported.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/