[PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"

From: Doug Smythies
Date: Thu Jul 18 2019 - 02:27:30 EST


This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.

The commit caused a regression whereby reducing the maximum
CPU clock frequency is ineffective while busy, and the CPU
clock remains unchanged. Once the system has experienced
some idle time, the new limit is implemented.
A consequence is that any thermal throttling monitoring
and control based on max freq limits fail to respond
in a timely manor, if at all, to a thermal temperature
trip on a busy system.

Please review the original e-mail threads, as this
attempt at a revertion would then re-open some of
those questions. One possible alterative to this
revertion would be to revert B7EAF1AAB9F8:

Author: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
Date: Wed Mar 22 00:08:50 2017 +0100

cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

References:

https://marc.info/?t=152576189300002&r=1&w=2
https://marc.info/?t=152586219000002&r=1&w=2
https://marc.info/?t=152607169800004&r=1&w=2

https://marc.info/?t=149013813900002&r=1&w=2
---
Test Data:

Typical commands:
watch -n 5 cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test).
- adjust downwards the maximum clock frequency
echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
On my system a lower thermal trip point it used,
because it doesn't get hot enough.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
adjusting the max_perf_pct but nothing happens in response.
- Once the lowest limit of max_perf_pct is reached, observe
thermald fall through to intel-powerclamp, and kidle inject
starts to be used. Therefore the "busy" condition (see the code)
is not longer met, and the CPU frequency suddenly goes to minimum.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
option, thereby excluding help from the fall through partner
experienced in test 2.
- Observations here tracked test 2 until the lower limit
of max_perf_pct is reached. The CPUs remain at maximum
frequency until the load is removed. Processor
temperature continues to climb.

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2 + this revert patch:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz changes on the turbostat output.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
adjusting the max_perf_pct and observe the actual frequency
adjust in response.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
option, adjusting max_perf_pct.
- Observations here tracked test 2. (test 2 never fell
through to intel_powerclamp and kidle inject.

Driver: acpi-cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "2200000" > $file; done
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use intel_powerclamp control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
adjusting the kidle_inj amounts.
- If the maximum kidle_inj is not enough, then oberve thermald
start to limit scaling_max_freq, however the actual frequency
response is immediate. Why? Because of the kidle inject, the
system is not longer fully "busy", and so that condition is not met.

Test 3: Limit thermald to using scaling_max_freq only:
- Adjust the cpu cooling device order to just include the one
option, thereby excluding help from previous partner
experienced in test 2.
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
reducing scaling_max_freq.
- observe the Bzy_MHz does not change on the turbostat output.
- Eventually, for all CPUs, scaling_max_freq is set to the minimum.
- Processor temperature continues to climb.
- The CPUs remain at maximum frequency until the load is removed.

Driver: acpi-cpufreq ; Governor: Schedutil
Kernel 5.2 + this revert patch:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "2200000" > $file; done
- observe the Bzy_MHz changes on the turbostat output.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use intel_powerclamp control first.
- enable and start thermald
- This test already passes using kernel 5.2, and
proper operation was observed with this revert.

Test 3: Limit thermald to using scaling_max_freq only:
- Adjust the cpu cooling device order to just include the one
option, thereby excluding help from previous partner
experienced in test 2.
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
reducing scaling_max_freq.
- observe the Bzy_MHz now does change on the turbostat output
and the processor package temperature is now controlled.
---
kernel/sched/cpufreq_schedutil.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 962cf34..ea4e04b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -89,8 +89,15 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
!cpufreq_this_cpu_can_update(sg_policy->policy))
return false;

- if (unlikely(sg_policy->need_freq_update))
+ if (unlikely(sg_policy->need_freq_update)) {
+ sg_policy->need_freq_update = false;
+ /*
+ * This happens when limits change, so forget the previous
+ * next_freq value and force an update.
+ */
+ sg_policy->next_freq = UINT_MAX;
return true;
+ }

delta_ns = time - sg_policy->last_freq_update_time;

@@ -168,10 +175,8 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,

freq = map_util_freq(util, freq, max);

- if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
+ if (freq == sg_policy->cached_raw_freq && sg_policy->next_freq != UINT_MAX)
return sg_policy->next_freq;
-
- sg_policy->need_freq_update = false;
sg_policy->cached_raw_freq = freq;
return cpufreq_driver_resolve_freq(policy, freq);
}
@@ -457,7 +462,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
*/
- if (busy && next_f < sg_policy->next_freq) {
+ if (busy && next_f < sg_policy->next_freq &&
+ sg_policy->next_freq != UINT_MAX) {
next_f = sg_policy->next_freq;

/* Reset cached freq as next_freq has changed */
@@ -819,7 +825,7 @@ static int sugov_start(struct cpufreq_policy *policy)

sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
sg_policy->last_freq_update_time = 0;
- sg_policy->next_freq = 0;
+ sg_policy->next_freq = UINT_MAX;
sg_policy->work_in_progress = false;
sg_policy->need_freq_update = false;
sg_policy->cached_raw_freq = 0;
--
2.7.4