[PATCH v5] docs/scheduler: Introduce the docs of load average

From: Jui-Tse Huang
Date: Thu Apr 07 2022 - 07:46:28 EST


The load average is one of a common as well as easy observed statistic
provied by Linux, but still not well documented, which makes the numbers
that users observes from the output of top, htop or other system
monitoring application are only numbers. This patch gives a discussion
on how Linux calculates the load average as well as what metrics are
concerned while calculating the load average.

Topics covered in this documentation:
1. The physical meaning of load average
2. How to peek at load average as well as what values could be fetched
from /proc/loadavg interface.
3. Run through the implementation of the updating process of load
average.

Signed-off-by: Jui-Tse Huang <juitse.huang@xxxxxxxxx>
Signed-off-by: Yiwei Lin <s921975628@xxxxxxxxx>
Signed-off-by: Ching-Chun (Jim) Huang <jserv@xxxxxxxxxxxxxxxx>
Co-Developed-by: Yiwei Lin <s921975628@xxxxxxxxx>

---

Notes:
Notes:
v5:
- Reorgnize the content (Peter Zijlstra)

v4:
- Fix typo and grammar error (Randy Dunlap)

v3:
- Fix typo (Randy Dunlap)
- Add further reading that links to Brendan Gregg's blog

v2:
- Fix typo (Chun-Hung Tseng)

Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/load-average.rst | 115 +++++++++++++++++++++++
2 files changed, 116 insertions(+)
create mode 100644 Documentation/scheduler/load-average.rst

diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 88900aabdbf7..bdc779b4190f 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -17,6 +17,7 @@ Linux Scheduler
sched-nice-design
sched-rt-group
sched-stats
+ load-average

text_files

diff --git a/Documentation/scheduler/load-average.rst b/Documentation/scheduler/load-average.rst
new file mode 100644
index 000000000000..a8dfcd66b20e
--- /dev/null
+++ b/Documentation/scheduler/load-average.rst
@@ -0,0 +1,115 @@
+============
+Load Average
+============
+The Load average, provided by common operating systems, indicates the average
+number of system loads over a period of time. In Linux, it shows the average
+number of tasks running and waiting for CPU time.
+
+The Physical Meaning
+--------------------
+A task is considered consuming hardware resources once it is in the TASK_RUNNING
+state as well as the TASK_UNINTERRUPTIBLE state but without being frozen (flag
+PF_FROZEN is not set) and is not marked as an idle task (TASK_NOLOAD). The
+former (TASK_RUNNING) indicates the task is consuming CPU or waiting for CPU,
+and the latter usually means a task is waiting for an I/O operation being
+completed, both of them will be covered in the load average.
+
+Two variables are covered while calculating the load average, one is the number
+of previously mentioned tasks (*active*), the other one is the load average of
+the previous time slot (*load_{t - 1}*). The weighted average of these two
+variables will then be calculated to provide the current load average of the
+system. Applying different weights to these variables controls how *active*
+affects the resulting load average (load_t). A higher weight applied to *active*
+makes *load_t* increase or decrease easily when facing a burst load or when the
+heavy tasks are complete, while a lower weight applied to *active* requires
+tasks to keep using hardware resources for a long time to affect *load_t*.
+
+Peeking at Load Average
+-----------------------
+The load average is provided in three different timescales (by applying
+different weights) and could be observed via system monitoring applications such
+as top or htop, or with the following command::
+
+ $cat /proc/loadavg
+
+The output of the command shown above has the format::
+
+ AVG_1 AVG_5 AVG_15 RUNNING/THREADS PID
+
+The first three fields are load averages:
+
+- AVG_{i}: The system load average over the last i minutes.
+
+Next field contains two numbers describing the task counts:
+
+- RUNNING: # of runable tasks on the system.
+- THREADS: # of not idle tasks on the system.
+
+The last field shows overall PID allocation:
+
+- PID: The last allocated PID.
+
+Implementation
+--------------
+The period of updating cycle is predefined as 5 seconds, which relies on the
+scheduler tick of each CPU. The scheduler tick of CPUs may be disabled with
+NO_HZ configuration (see `NO_HZ: Reducing Scheduling-Clock Ticks
+<https://www.kernel.org/doc/html/latest/timers/no_hz.html>`_ for detail), which
+may cause miss counting the *active* of those CPUs (a CPU may disable its
+scheduler tick even it is not idle) as well as miss the updating cycle of load
+average (the scheduler tick of all CPUs is disabled), thus, the situation
+should be dealt explicitly.
+
+At the beginning of each iteration, a 10-tick sample window is given to avoid
+synchronization costs between CPUs. While in the sample window, CPUs will upload
+the *active* of their run queue to a global variable `calc_load_tasks`, and the
+job is done in the function `calc_global_load_tick()`. And the *active* of the
+scheduler tick disabled CPUs is collected with the following steps:
+
+#. Before a CPU disables its scheduler tick, the function
+ `calc_load_nohz_start()` is invoked to upload its *active* even if the sample
+ window is not met yet.
+#. The *active* of the scheduler tick disabled CPU is uploaded by other CPU via
+ the function `calc_load_nohz_remote()`.
+#. The uploaded *active* is merged with *active* held by `calc_load_tasks` in
+ the function `calc_global_load()`.
+#. Once the scheduler tick of a CPU is restarted, the missed updating period is
+ synchronized in the function `calc_load_nohz_stop()`.
+
+Two counters, elements of array `calc_load_nohz[2]`, aim to hold the pending
+*active* of the scheduler tick disabled CPUs and will be swapped between cycles.
+Functions `calc_load_read_idx()` and `calc_load_write_idx()` will provide
+correct index into the array. Worth to notice that the index for writing is
+renewed once steps into a new updating cycle, which makes the uploaded *active*
+will be considered in the next cycle, while the index for reading is renewed
+after the sample window.
+
+Once the sample window ends, the function `calc_global_load()` will invoke the
+function `calc_load()` which calculates the load average base on the following
+formula::
+
+ load_{t + 1} = load_{t} * exp + active * (1 - exp)
+
+And if we expend the formula recursively, we will have following expression::
+
+ load_{t + 2} = load_{t + 1} * exp + active * (1 - exp)
+ = (load_{t} * exp + active * (1 - exp)) * exp + active * (1 - exp)
+ = load_{t} * exp^2 + active * (exp - exp^2 + 1 - exp)
+ = load_{t} * exp^2 + active * (1 - exp^2)
+ ...
+ load_{t + n} = load_{t} * exp^n + active * (1 - exp^n)
+
+Thus, if there is any updating cycle missed, the weights taken to get the load
+average should be modified as the weight to the power of n, where n is the
+number of missed cycles. The function `calc_global_nohz()`, invoked at the end
+of the function `calc_global_load()`, takes the responsibility to calculate the
+number of missed cycles, and with the function `calc_load_n()`, the modified
+weights are applied to get the result.
+
+See `kernel/sched/loadavg.c` for detail.
+
+Further Reading
+---------------
+Brendan Gregg has explained the history of load average as well as analysis over
+it in his blog post:
+https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
--
2.25.1