Re: [RFC 5/8] Track the "total rq utilisation" too

From: Luca Abeni
Date: Fri Jan 29 2016 - 16:21:34 EST


Hi Peter,

On Fri, 29 Jan 2016 16:06:05 +0100
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Fri, Jan 15, 2016 at 10:15:11AM +0100, Luca Abeni wrote:
>
> > There is also a newer paper, that will be published at ACM SAC 2016
> > (so, it is not available yet), but is based on this technical
> > report: http://arxiv.org/abs/1512.01984
> > This second paper describes some more complex algorithms (easily
> > implementable over this patchset) that are able to guarantee hard
> > schedulability for SCHED_DEADLINE tasks with reclaiming on SMP.
>
> So I finally got around to reading the relevant sections of that paper
> (5.1 and 5.2).
>
> The paper introduces two alternatives;
>
> - parallel reclaim (5.1)
> - sequential reclaim (5.2)
>
> The parent patch introduces the accounting required for sequential
> reclaiming IIUC.
The patches I posted implements something similar to sequential
reclaiming (they actually implement the algorithm described in the
RTLWS paper: http://disi.unitn.it/~abeni/reclaiming/rtlws14-grub.pdf).

The parallel and sequential reclaiming algorithms can be implemented on
top of the RFC patches I posted. See
https://github.com/lucabe72/linux-reclaiming/commits/reclaiming-new-v3
commits from "Move to the new M-GRUB definition (BCL incarnation)",
which implements sequential reclaiming. Parallel reclaiming is
implemented in the following 2 commits.
I did not post these last patches because I feel they are too premature
even for an RFC :)

> Thinking about things however, I think I would prefer parallel reclaim
> over sequential reclaim. The problem I see with sequential reclaim is
> that under light load jobs might land on different CPUs and not
> benefit from reclaim (as much) since the 'spare' bandwidth is stuck
> on other CPUs.
>
> Now I suppose the exact conditions to hit that worst case might be
> quite hard to trigger, in which case it might just not matter in
> practical terms.
>
> But maybe I'm mistaken, the paper doesn't seem to compare the two
> approaches in this way.
The technical report does not present any comparison, but we have an ACM
SAC paper (still to be published) that presents some experiments
comparing the two algorithms. And, you are right: parallel reclaiming
seems to work better.

However, parallel reclaiming requires a "global" (per scheduling
domain) variable to keep track of the total active (or inactive)
utilization... And this is updated every time a task blocks/unblock.
So, I expected more overhead, or scalability issues... But in the
experiments I have not been able to measure this overhead (I "only"
tested on a dual Xeon, with 4 cores per CPU... Maybe I needed more
runqueues/CPU cores to see scalability issues?).
Also, the implementation of parallel reclaiming is approximated (for
example: when a task wakes up on a runqueue, the "global" active
utilization is updated... But before updating it we should properly
account the runtime of the SCHED_DEADLINE tasks executing on all the
other runqueues of the scheduling domain... However, I only account the
runtime when a tick fires or when a task blocks). But my experiments
were not able to show any performance degradation because of this
approximation...

I got the impression that "per-runqueue active utilization" can be
tracked with less overhead (and can be more useful: for example, to
drive frequency scaling), and this is why I posted this patchset in the
RFC... But if you think that "global active utilization" (leading to
parallel reclaiming) can be more useful, I can reorder the patches and
post an RFC with a parallel reclaiming implementation.



Thanks,
Luca