[RFC v2 1/8] sched/tune: add detailed documentation

From: Patrick Bellasi
Date: Thu Oct 27 2016 - 13:41:27 EST


The topic of a single simple power-performance tunable, that is wholly
scheduler centric, and has well defined and predictable properties has
come up on several occasions in the past. With techniques such as a
scheduler driven DVFS, which is now provided mainline via the schedutil
governor, we now have a good framework for implementing such a tunable.

This patch provides a detailed description of the motivations and design
decisions behind the implementation of SchedTune.

Cc: Jonathan Corbet <corbet@xxxxxxx>
Cc: linux-doc@xxxxxxxxxxxxxxx
Signed-off-by: Patrick Bellasi <patrick.bellasi@xxxxxxx>
---
Documentation/scheduler/sched-tune.txt | 392 +++++++++++++++++++++++++++++++++
1 file changed, 392 insertions(+)
create mode 100644 Documentation/scheduler/sched-tune.txt

diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
new file mode 100644
index 0000000..da7b3eb
--- /dev/null
+++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,392 @@
+ SchedTune Support for CFS Tasks
+ central, scheduler-driven, power-performance control
+ (EXPERIMENTAL)
+
+Abstract
+========
+
+The topic of a single simple power-performance tunable, that is wholly
+scheduler centric and has well defined and predictable properties, has come up
+on several occasions in the past [1,2]. With techniques such as a scheduler
+driven DVFS [3], we now have a good framework for implementing such a tunable.
+
+Scheduler driven DVFS provides the foundation mechanism on top of which it's
+possible to differentiate the performance levels based on the specific
+requirements of different use-cases. For example, on mobile systems it's likely
+to have demanding use-cases which may benefit from running on an higher OPP
+than the one currently chosen by schedutil.
+
+In the past this behavior was provided by changing the governor or by tuning
+its the many different parameters of non-mainline governors, now we can achieve
+similar behaviors by tuning schedutil. This approach allows also for a more
+fine grained control which can be extended to consider the requirement of
+specific tasks.
+
+This document introduces SchedTune and describes the overall ideas behind its
+design and implementation.
+
+Table of Contents
+=================
+
+1. Motivation
+2. Introduction
+3. Signal Boosting Strategy
+4. OPP selection using boosted CPU utilization
+5. Per task group boosting
+6. Question and Answers
+ - What about "auto" mode?
+ - What about boosting on a congested system?
+ - How CPUs are boosted when we have tasks with multiple boost values?
+7. References
+
+
+1. Motivation
+=============
+
+schedutil [3] is a new event-driven cpufreq governor which allows the scheduler
+to select the optimal DVFS Operating Performance Point (OPP) for running a task
+allocated to a CPU. The introduction of schedutil enables running workloads at
+the most efficient OPPs.
+
+However, sometimes it may be desired to intentionally boost the performance of
+a workload even if that could imply a reasonable increase in energy
+consumption. For example, in order to reduce the response time of a task, we
+may want to run the task at a higher OPP than the one actually required by its
+CPU bandwidth demand.
+
+This last requirement is especially important if we consider that schedutil can
+potentially replace all currently available CPUFreq policies. Since schedutil
+is event based, as opposed to the sampling driven governors, it is already more
+responsive at selecting the optimal OPP to run tasks allocated to a CPU.
+However, just tracking the actual task utilization demand may not be enough
+from a performance standpoint. For example, it is not possible to get
+behaviors similar to those provided by the "performance" and "powersave"
+CPUFreq governors.
+
+This document describes an implementation of a tunable, stacked on top of
+schedutil, which extends its functionality to support task performance
+boosting.
+
+By "performance boosting" we mean the reduction of the time required to
+complete a task activation, i.e. the time elapsed from a task wakeup to its
+next deactivation (e.g. because it goes back to sleep or it terminates).
+For example, if we consider a simple periodic task which executes the same
+workload for 5[ms] every 20[ms] while running at a certain OPP, a boosted
+execution of that task should be able to complete each of its activations in
+less than 5[ms].
+
+Previous attempts to introduce such a boosting feature has not been successful,
+mainly because of the complexity of the proposed solution. The approach
+described in this document exposes a single simple interface to user-space.
+This single knob allows the tuning of system wide scheduler behaviours ranging
+from energy efficiency at one end through to incremental performance boosting
+at the other end. This tunable affects all tasks. A more advanced extension of
+this concept is also provided, which uses CGroups to boost the performance of
+selected tasks while using the default schedutil behaviors for all others.
+
+The rest of this document introduces in more details the proposed solution
+which has been named SchedTune.
+
+
+2. Introduction
+===============
+
+SchedTune exposes a simple user-space interface with a single power-performance
+tunable:
+
+ /proc/sys/kernel/sched_cfs_boost
+
+This permits expressing a boost value as an integer in the range [0..100].
+
+A value of 0 (default) for a CFS task means that schedutil will attempt
+to match compute capacity of the CPU where the task is scheduled to
+match its current utilization with a few spare cycles left. A value of
+100 means that schedutil will select the highest available OPP.
+
+The range between 0 and 100 can be set to satisfy other scenarios suitably.
+For example to satisfy interactive response or depending on other system events
+(battery level, thermal status, etc).
+
+A CGroup based extension is also provided, which permits further user-space
+defined task classification to tune the scheduler for different goals depending
+on the specific nature of the task, e.g. background vs interactive vs
+low-priority.
+
+The overall design of the SchedTune module is built on top of "Per-Entity Load
+Tracking" (PELT) signals and schedutil by introducing a bias on the Operating
+Performance Point (OPP) selection. Each time a task is allocated on a CPU,
+schedutil has the opportunity to tune the operating frequency of that CPU to
+better match the workload demand. The selection of the actual OPP being
+activated is influenced by the global boost value, or the boost value for the
+task CGroup when in use.
+
+This simple biasing approach leverages existing frameworks, which means minimal
+modifications to the scheduler, and yet it allows to achieve a range of
+different behaviours all from a single simple tunable knob. The only new
+concept introduced is that of signal boosting.
+
+
+3. Signal Boosting Strategy
+===========================
+
+The whole PELT machinery works based on the value of a few utilization tracking
+signals which basically track the CPU bandwidth requirements for tasks and the
+capacity of CPUs. The basic idea behind the SchedTune knob is to artificially
+inflate some of these utilization tracking signals to make a task or RQ appears
+more demanding than it actually is.
+
+Which signals have to be inflated depends on the specific "consumer". However,
+independently from the specific (signal, consumer) pair, it is important to
+define a simple and possibly consistent strategy for the concept of boosting a
+signal.
+
+A boosting strategy defines how the "abstract" user-space defined
+sched_cfs_boost value is translated into an internal "margin" value to be added
+to a signal to get its inflated value:
+
+ margin := boosting_strategy(sched_cfs_boost, signal)
+ boosted_signal := signal + margin
+
+Different boosting strategies were identified and analyzed before selecting the
+one found to be most effective. The next section describes the details of this
+boosting strategy.
+
+Signal Proportional Compensation (SPC)
+--------------------------------------
+
+In this boosting strategy the sched_cfs_boost value is used to compute a margin
+which is proportional to the complement of the original signal. This
+complement is defined as the delta from the actual value of a signal and its
+possible maximum value.
+
+Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE
+as the maximum possible value, the margin becomes:
+
+ margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal)
+
+Using this boosting strategy:
+- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
+- each value in the range of sched_cfs_boost effectively inflates the signal in
+ question by a quantity which is proportional to the maximum value.
+
+For example, by applying the SPC boosting strategy to the selection of the OPP
+to run a task it is possible to achieve these behaviors:
+
+- 0% boosting: run the task at the minimum OPP required by its workload
+- 100% boosting: run the task at the maximum OPP available for the CPU
+- 50% boosting: run at the half-way OPP between minimum and maximum
+
+Which means that, at 50% boosting, a task will be scheduled to run at half of
+the maximum theoretically achievable performance on the specific target
+platform.
+
+For example, assuming a 50% task which runs for 8ms every 16ms,
+ignoring any margins built into schedutil we should get:
+
+ Boost OPP Expected Completion Time
+ 0% 50% 16.00 ms
+ 50% 75% 16*50/75 = 10.67 ms
+ 100% 100% 8.00 ms
+
+The reduction of the completion time of boosting 100% instead of 50% is
+only 25%.
+
+A graphical representation of an SPC boosted signal is represented in the
+following figure where:
+ a) "-" represents the original signal
+ b) "b" represents a 50% boosted signal
+ c) "p" represents a 100% boosted signal
+
+
+ | SCHED_CAPACITY_SCALE
+ +-----------------------------------------------------------------+
+ |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
+ |
+ | boosted_signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb
+ |
+ | original signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
+ | |
+ |bbbbbbbbbbbbbbbbbb |
+ | |
+ | |
+ | |
+ | +-----------------------+
+ | |
+ | |
+ | |
+ |------------------+
+ |
+ |
+ +----------------------------------------------------------------------->
+
+The plot above shows a ramped utilization signal (titled 'original_signal') and
+its boosted equivalent. For each step of the original signal the boosted signal
+corresponding to a 50% boost is midway from the original signal and the upper
+bound. Boosting by 100% generates a boosted signal which is always saturated to
+the upper bound.
+
+
+4. OPP selection using boosted CPU utilization
+==============================================
+
+It is worth calling out that the implementation does not introduce any new
+utilization signals. Instead, it provides an API to tune existing signals. This
+tuning is done on demand and only in scheduler code paths where it is sensible
+to do so. The new API calls are defined to return either the default signal or
+a boosted one, depending on the value of sched_cfs_boost. This is a clean and
+non invasive modification of the existing existing code paths.
+
+The signal representing a CPU's utilization is boosted according to the
+previously described SPC boosting strategy. To schedutil, this allows a CPU
+(ie CFS run-queue) to appear more used then it actually is.
+
+Thus, with the sched_cfs_boost enabled we have the following main functions to
+get the current utilization of a CPU:
+
+ cpu_util()
+ boosted_cpu_util()
+
+The new boosted_cpu_util() is similar to the first but returns a boosted
+utilization signal which is a function of the sched_cfs_boost value.
+
+This function is used in the CFS scheduler code paths where schedutil needs to
+decide the OPP to run a CPU at.
+For example, this allows selecting the highest OPP for a CPU which has
+the boost value set to 100%.
+
+
+5. Per task group boosting
+==========================
+
+The availability of a single knob which is used to boost all tasks in the
+system is certainly a simple solution but it quite likely doesn't fit many
+use-case, especially in the mobile device space.
+
+For example, on battery powered devices there usually are many background
+services which are long running and need energy efficient scheduling. On the
+other hand, some applications are more performance sensitive and require an
+interactive response and/or maximum performance, regardless of the energy cost.
+To better service such scenarios, the SchedTune implementation has an extension
+that provides a more fine grained boosting interface.
+
+A new CGroup controller, namely "schedtune", could be enabled which allows to
+defined and configure task groups with different boosting values. Tasks that
+require special performance can be put into separate CGroups. The value of the
+boost associated with the tasks in this group can be specified using a single
+knob exposed by the CGroup controller:
+
+ schedtune.boost
+
+This knob allows the definition of a boost value that is to be used for
+SPC boosting of all tasks attached to this group.
+
+The current schedtune controller implementation is really simple and has these
+main characteristics:
+
+ 1) It is only possible to create 1 level depth hierarchies
+
+ The root control groups define the system-wide boost value to be applied
+ by default to all tasks. Its direct subgroups are named "boost groups" and
+ they define the boost value for specific set of tasks.
+ Further nested subgroups are not allowed since they do not have a sensible
+ meaning from a user-space standpoint.
+
+ 2) It is possible to define only a limited number of "boost groups"
+
+ This number is defined at compile time and by default configured to 16.
+ This is a design decision motivated by two main reasons:
+ a) In a real system we do not expect use-cases with more then few
+ boost groups. For example, a reasonable collection of groups could be
+ just "background", "interactive" and "performance".
+ b) It simplifies the implementation considerably, especially for the code
+ which has to compute the per CPU boosting once there are multiple
+ RUNNABLE tasks with different boost values.
+
+Such a simple design should allow servicing the main utilization scenarios
+identified so far. It provides a simple interface which can be used to manage
+the power-performance of all tasks or only selected tasks. Moreover, this
+interface can be easily integrated by user-space run-times (e.g. Android,
+ChromeOS) to implement a QoS solution for task boosting based on tasks
+classification, which has been a long standing requirement.
+
+Setup and usage
+---------------
+
+0. Use a kernel with CGROUP_SCHED_TUNE support enabled
+
+1. Check that the "schedtune" CGroup controller is available:
+
+ root@derkdell:~# cat /proc/cgroups
+ #subsys_name hierarchy num_cgroups enabled
+ cpuset 0 1 1
+ cpu 0 1 1
+ schedtune 0 1 1
+
+2. Mount a tmpfs to create the CGroups mount point (Optional)
+
+ root@derkdell:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
+
+3. Mount the "schedtune" controller
+
+ root@derkdell:~# mkdir /sys/fs/cgroup/stune
+ root@derkdell:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
+
+4. Setup the system-wide boost value (Optional)
+
+ If not configured the root control group has a 0% boost value, which
+ basically disables boosting for all tasks in the system thus running in
+ an energy-efficient mode. Let assume SYSBOOST defines the default boost
+ value to be used for all tasks:
+
+ root@derkdell:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
+
+5. Create task groups and configure their specific boost value (Optional)
+
+ For example here we create a "performance" boost group configure to boost
+ all its tasks to 100%
+
+ root@derkdell:~# mkdir /sys/fs/cgroup/stune/performance
+ root@derkdell:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
+
+6. Move tasks into the boost group
+
+ For example, the following moves the tasks with PID $TASKPID (and all its
+ tasks) into the "performance" boost group.
+
+ root@derkdell:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
+
+This simple configuration allows only the tasks of the $TASKPID task to run,
+when needed, at the highest OPP in the most capable CPU of the system.
+
+
+6. Question and Answers
+=======================
+
+What about "auto" mode?
+-----------------------
+
+The 'auto' mode as described in previous propose approaches can be implemented
+by interfacing SchedTune with some suitable user-space element. This element
+could use the exposed system-wide or cgroup based interface.
+
+How are multiple groups of tasks with different boost values managed?
+---------------------------------------------------------------------
+
+The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
+on a CPU. Once schedutil selects the OPP for a CPU, the utilization of that CPU
+is boosted with a value which is the maximum of the boost values of all the
+currently RUNNABLE tasks in that CPU.
+
+This allows schedutil to boost a CPU only while there are boosted tasks ready
+to run and switch back to the energy efficient mode as soon as the last boosted
+task is dequeued.
+
+
+7. References
+=============
+[1] http://lwn.net/Articles/552889
+[2] http://lkml.org/lkml/2012/5/18/91
+[3] http://lkml.org/lkml/2016/3/16/559
+
--
2.10.1