Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

From: Luca Abeni
Date: Tue Aug 22 2017 - 08:21:55 EST


Hi Mathieu,

On Wed, 16 Aug 2017 15:20:36 -0600
Mathieu Poirier <mathieu.poirier@xxxxxxxxxx> wrote:

> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
> operations. When CPUhotplug and some CUPset manipulation take place root
> domains are destroyed and new ones created, loosing at the same time DL
> accounting pertaining to utilisation.

Thanks for looking at this longstanding issue! I am just back from
vacations; in the next days I'll try your patches.
Do you have some kind of scripts for reproducing the issue
automatically? (I see that in the original email Steven described how
to reproduce it manually; I just wonder if anyone already scripted the
test).

> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
> rq_offline() methods, something that highlighted a problem with sleeping
> DL tasks. The email thread that followed envisioned creating a list of
> sleeping tasks to circle through when recomputing DL accounting.
>
> In this set the problem is addressed by relying on existing list of tasks
> (sleeping or not) already maintained by CPUsets. When CPUset or
> CPUhotplug operations have completed we circle through the list of tasks
> maintained by each CPUset looking for DL tasks. When a DL task is found
> its utilisation is added to the root domain it pertains to by way of its
> runqueue.
>
> The advantage of proceeding this way is that recomputing of DL accounting
> is done the same way for both active and inactive tasks, along with
> guaranteeing that DL accounting for tasks end up in the correct root
> domain regardless of the CPUset topology. The disadvantage is that
> circling through all the tasks in a CPUset can be time consuming. The
> counter argument is that both CPUset and CPUhotplug operations are time
> consuming in the first place.

I do not know the cpuset code too much, but I agree that your approach
looks better than creating an additional list for blocked deadline
tasks.


> OPEN ISSUE:
>
> Regardless of how we proceed (using existing CPUset list or new ones) we
> need to deal with DL tasks that span more than one root domain, something
> that will typically happen after a CPUset operation. For example, if we
> split the number of available CPUs on a system in two CPUsets and then turn
> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
> parent CPUset will end up spanning two root domains.
>
> One way to deal with this is to prevent CPUset operations from happening
> when such condition is detected, as enacted in this set.

I think this is the simplest (if not only?) solution if we want to use
gEDF in each root domain.

> Although simple
> this approach feels brittle and akin to a "whack-a-mole" game. A better
> and more reliable approach would be to teach the DL scheduler to deal with
> tasks that span multiple root domains, a serious and substantial
> undertaking.
>
> I am sending this as a starting point for discussion. I would be grateful
> if you could take the time to comment on the approach and most importantly
> provide input on how to deal with the open issue underlined above.

I suspect that if we want to guarantee bounded tardiness then we have to
go for a solution similar to the one suggested by Tommaso some time ago
(if I remember well):

if we want to create some "second level cpusets" inside a "parent
cpuset", allowing deadline tasks to be placed inside both the "parent
cpuset" and the "second level cpusets", then we have to subtract the
"second level cpusets" maximum utilizations from the "parent cpuset"
utilization.

I am not sure how difficult it can be to implement this...


If, instead, we want to allow to guarantee the respect of all the
deadlines, then we need to have a look at Brandenburg's paper on
arbitrary affinities:
https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf


Thanks,
Luca