[PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

From: Mathieu Poirier
Date: Wed Aug 16 2017 - 17:20:53 EST


This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
operations. When CPUhotplug and some CUPset manipulation take place root
domains are destroyed and new ones created, loosing at the same time DL
accounting pertaining to utilisation.

An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
rq_offline() methods, something that highlighted a problem with sleeping
DL tasks. The email thread that followed envisioned creating a list of
sleeping tasks to circle through when recomputing DL accounting.

In this set the problem is addressed by relying on existing list of tasks
(sleeping or not) already maintained by CPUsets. When CPUset or
CPUhotplug operations have completed we circle through the list of tasks
maintained by each CPUset looking for DL tasks. When a DL task is found
its utilisation is added to the root domain it pertains to by way of its
runqueue.

The advantage of proceeding this way is that recomputing of DL accounting
is done the same way for both active and inactive tasks, along with
guaranteeing that DL accounting for tasks end up in the correct root
domain regardless of the CPUset topology. The disadvantage is that
circling through all the tasks in a CPUset can be time consuming. The
counter argument is that both CPUset and CPUhotplug operations are time
consuming in the first place.

OPEN ISSUE:

Regardless of how we proceed (using existing CPUset list or new ones) we
need to deal with DL tasks that span more than one root domain, something
that will typically happen after a CPUset operation. For example, if we
split the number of available CPUs on a system in two CPUsets and then turn
off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
parent CPUset will end up spanning two root domains.

One way to deal with this is to prevent CPUset operations from happening
when such condition is detected, as enacted in this set. Although simple
this approach feels brittle and akin to a "whack-a-mole" game. A better
and more reliable approach would be to teach the DL scheduler to deal with
tasks that span multiple root domains, a serious and substantial
undertaking.

I am sending this as a starting point for discussion. I would be grateful
if you could take the time to comment on the approach and most importantly
provide input on how to deal with the open issue underlined above.

Many thanks,
Mathieu

[1]. https://lkml.org/lkml/2016/2/3/966
[2]. https://marc.info/?l=linux-kernel&m=145493552607388&w=2

Mathieu Poirier (7):
sched/topology: Adding function partition_sched_domains_locked()
cpuset: Rebuild root domain deadline accounting information
sched/deadline: Keep new DL task within root domain's boundary
cgroup: Constrain 'sched_load_balance' flag when DL tasks are present
cgroup: Concentrate DL related validation code in one place
cgroup: Constrain the addition of CPUs to a new CPUset
sched/core: Don't change the affinity of DL tasks

include/linux/sched.h | 3 +
include/linux/sched/deadline.h | 8 ++
include/linux/sched/topology.h | 9 ++
kernel/cgroup/cpuset.c | 186 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/core.c | 10 +--
kernel/sched/deadline.c | 47 ++++++++++-
kernel/sched/sched.h | 3 -
kernel/sched/topology.c | 31 +++++--
8 files changed, 272 insertions(+), 25 deletions(-)

--
2.7.4