[PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

From: Preeti U Murthy
Date: Sun Feb 08 2015 - 23:50:49 EST


The powerclamp driver injects idle periods to stay within the thermal constraints.
The driver does a fake idle by spawning per-cpu threads that call the mwait
instruction. This behavior of fake idle can confuse the other kernel subsystems.
For instance it calls into the nohz tick handlers, which are meant to be called
only by the idle thread. It sets the state of the tick in each cpu to idle and
stops the tick, when there are tasks on the runqueue. As a result the callers of
idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
may be inconsistency in the scheduler/nohz states which can lead to serious
consequences. One of them was reported on this thread:
https://lkml.org/lkml/2014/12/11/365.

Thomas posted out a patch to disable the powerclamp driver from calling into the
tick nohz code which has taken care of the above regression for the moment. However
powerclamp driver as a result, will not be able to inject idle periods due to the
presence of periodic ticks. With the current design of fake idle, we cannot move
towards a better solution.
https://lkml.org/lkml/2014/12/18/169

This patch aims at removing the concept of fake idle and instead makes the cpus
truly idle by throttling the runqueues during the idle injection periods. The situation
is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
'period' to achieve the same result. When the powerclamping is begun or when the
clamping controls have been modified, the bandwidth for the root task group is set.
The 'quota' will be the amount of time that the system needs to be busy and 'period'
will be the sum of this busy duration and the idle duration as calculated by the driver.
This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
the thread starting powerclamping will set the bandwidth and throttling of all cpus will
automatically fall in place. None of the other cpus need be bothered about this. This
simplifies the design of the driver.

Of course this is only if the idle injection metrics can be conveniently transformed
into bandwidth control metrics. There are a couple of other primary concerns around if
doing the below two in this patch is valid.
a. This patch exports the functions to set the quota and period of task groups.
b. This patch removes the constraint of not being able to set the root task grp's bandwidth.

Signed-off-by: Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx>
---
V1: https://lkml.org/lkml/2014/12/21/23

V1 tried to throttle cfs_rqs without modifying available bandwidth. This will not
be an appropriate solution since the scheduler only considers a runqueue as
throttled when its bandwidth has been exceeded. Going around this path can
lead to unintended consequences. Hence V2 was designed to throttle runqueues
through bandwidth control.

drivers/thermal/intel_powerclamp.c | 289 +++++++++---------------------------
include/linux/sched.h | 6 +
kernel/sched/core.c | 5 -
3 files changed, 82 insertions(+), 218 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 6ceebd6..f43903c 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -78,20 +78,14 @@ static struct dentry *debug_dir;
static unsigned int set_target_ratio;
static unsigned int current_ratio;
static bool should_skip;
+static unsigned int count = 0;
static bool reduce_irq;
static atomic_t idle_wakeup_counter;
-static unsigned int control_cpu; /* The cpu assigned to collect stat and update
- * control parameters. default to BSP but BSP
- * can be offlined.
- */
static bool clamping;
+static void clamp_timer_fn(unsigned long foo);
+DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);

-
-static struct task_struct * __percpu *powerclamp_thread;
static struct thermal_cooling_device *cooling_dev;
-static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
- * clamping thread
- */

static unsigned int duration;
static unsigned int pkg_cstate_ratio_cur;
@@ -163,7 +157,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
smp_mb();

exit_win:
-
+ clamp_cpus();
return ret;
}

@@ -256,10 +250,6 @@ static u64 pkg_state_counter(void)
return count;
}

-static void noop_timer(unsigned long foo)
-{
- /* empty... just the fact that we get the interrupt wakes us up */
-}

static unsigned int get_compensation(int ratio)
{
@@ -362,102 +352,78 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
return set_target_ratio + guard <= current_ratio;
}

-static int clamp_thread(void *arg)
+static int clamp_cpus(void)
{
- int cpunr = (unsigned long)arg;
- DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
- static const struct sched_param param = {
- .sched_priority = MAX_USER_RT_PRIO/2,
- };
- unsigned int count = 0;
unsigned int target_ratio;
+ u64 quota = RUNTIME_INF, period;

- set_bit(cpunr, cpu_clamping_mask);
- set_freezable();
- init_timer_on_stack(&wakeup_timer);
- sched_setscheduler(current, SCHED_FIFO, &param);
-
- while (true == clamping && !kthread_should_stop() &&
- cpu_online(cpunr)) {
- int sleeptime;
- unsigned long target_jiffies;
- unsigned int guard;
- unsigned int compensation = 0;
- int interval; /* jiffies to sleep for each attempt */
- unsigned int duration_jiffies = msecs_to_jiffies(duration);
- unsigned int window_size_now;
-
- try_to_freeze();
- /*
- * make sure user selected ratio does not take effect until
- * the next round. adjust target_ratio if user has changed
- * target such that we can converge quickly.
- */
- target_ratio = set_target_ratio;
- guard = 1 + target_ratio/20;
- window_size_now = window_size;
- count++;
-
- /*
- * systems may have different ability to enter package level
- * c-states, thus we need to compensate the injected idle ratio
- * to achieve the actual target reported by the HW.
- */
- compensation = get_compensation(target_ratio);
- interval = duration_jiffies*100/(target_ratio+compensation);
-
- /* align idle time */
- target_jiffies = roundup(jiffies, interval);
- sleeptime = target_jiffies - jiffies;
- if (sleeptime <= 0)
- sleeptime = 1;
- schedule_timeout_interruptible(sleeptime);
- /*
- * only elected controlling cpu can collect stats and update
- * control parameters.
- */
- if (cpunr == control_cpu && !(count%window_size_now)) {
- should_skip =
- powerclamp_adjust_controls(target_ratio,
- guard, window_size_now);
- smp_mb();
- }
+ if (clamping == false)
+ goto out;

- if (should_skip)
- continue;
-
- target_jiffies = jiffies + duration_jiffies;
- mod_timer(&wakeup_timer, target_jiffies);
- if (unlikely(local_softirq_pending()))
- continue;
- /*
- * stop tick sched during idle time, interrupts are still
- * allowed. thus jiffies are updated properly.
- */
- preempt_disable();
- /* mwait until target jiffies is reached */
- while (time_before(jiffies, target_jiffies)) {
- unsigned long ecx = 1;
- unsigned long eax = target_mwait;
-
- /*
- * REVISIT: may call enter_idle() to notify drivers who
- * can save power during cpu idle. same for exit_idle()
- */
- local_touch_nmi();
- stop_critical_timings();
- mwait_idle_with_hints(eax, ecx);
- start_critical_timings();
- atomic_inc(&idle_wakeup_counter);
- }
- preempt_enable();
+again:
+ int sleeptime;
+ unsigned long target_jiffies;
+ unsigned int guard;
+ unsigned int compensation = 0;
+ int interval; /* jiffies to sleep for each attempt */
+ unsigned int duration_jiffies = msecs_to_jiffies(duration);
+ unsigned int window_size_now;
+
+ /*
+ * make sure user selected ratio does not take effect until
+ * the next round. adjust target_ratio if user has changed
+ * target such that we can converge quickly.
+ */
+ target_ratio = set_target_ratio;
+ guard = 1 + target_ratio/20;
+ window_size_now = window_size;
+
+ /*
+ * systems may have different ability to enter package level
+ * c-states, thus we need to compensate the injected idle ratio
+ * to achieve the actual target reported by the HW.
+ */
+ compensation = get_compensation(target_ratio);
+ interval = duration_jiffies*100/(target_ratio+compensation);
+
+ /* align idle time */
+ target_jiffies = roundup(jiffies, interval);
+ sleeptime = target_jiffies - jiffies;
+ if (sleeptime <= 0)
+ sleeptime = 1;
+
+ if (!(count%window_size_now)) {
+ should_skip =
+ powerclamp_adjust_controls(target_ratio,
+ guard, window_size_now);
+ smp_mb();
}
- del_timer_sync(&wakeup_timer);
- clear_bit(cpunr, cpu_clamping_mask);
+
+ if (should_skip)
+ goto again;
+
+ target_jiffies = jiffies + sleeptime + duration_jiffies;
+ mod_timer(&wakeup_timer, target_jiffies);
+ if (unlikely(local_softirq_pending()))
+ goto again;
+
+ quota = jiffies_to_usecs(sleeptime);
+ period = jiffies_to_usecs(sleeptime + duration_jiffies);
+
+out:
+ tg_set_cfs_quota(&root_task_group, quota);
+ tg_set_cfs_period(&root_task_group, period);

return 0;
}

+static void clamp_timer_fn(unsigned long foo)
+{
+ /* Evaluate to see if clamping controls need to be adjusted */
+ count++;
+ clamp_cpus();
+}
+
/*
* 1 HZ polling while clamping is active, useful for userspace
* to monitor actual idle ratio.
@@ -501,8 +467,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)

static int start_power_clamp(void)
{
- unsigned long cpu;
- struct task_struct *thread;
+ clamping = true;

/* check if pkg cstate counter is completely 0, abort in this case */
if (!has_pkg_state_counter()) {
@@ -511,108 +476,21 @@ static int start_power_clamp(void)
}

set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
- /* prevent cpu hotplug */
- get_online_cpus();
-
- /* prefer BSP */
- control_cpu = 0;
- if (!cpu_online(control_cpu))
- control_cpu = smp_processor_id();

- clamping = true;
schedule_delayed_work(&poll_pkg_cstate_work, 0);
-
- /* start one thread per online cpu */
- for_each_online_cpu(cpu) {
- struct task_struct **p =
- per_cpu_ptr(powerclamp_thread, cpu);
-
- thread = kthread_create_on_node(clamp_thread,
- (void *) cpu,
- cpu_to_node(cpu),
- "kidle_inject/%ld", cpu);
- /* bind to cpu here */
- if (likely(!IS_ERR(thread))) {
- kthread_bind(thread, cpu);
- wake_up_process(thread);
- *p = thread;
- }
-
- }
- put_online_cpus();
+ clamp_cpus();

return 0;
}

static void end_power_clamp(void)
{
- int i;
- struct task_struct *thread;
-
clamping = false;
- /*
- * make clamping visible to other cpus and give per cpu clamping threads
- * sometime to exit, or gets killed later.
- */
- smp_mb();
- msleep(20);
- if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
- for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
- pr_debug("clamping thread for cpu %d alive, kill\n", i);
- thread = *per_cpu_ptr(powerclamp_thread, i);
- kthread_stop(thread);
- }
- }
-}

-static int powerclamp_cpu_callback(struct notifier_block *nfb,
- unsigned long action, void *hcpu)
-{
- unsigned long cpu = (unsigned long)hcpu;
- struct task_struct *thread;
- struct task_struct **percpu_thread =
- per_cpu_ptr(powerclamp_thread, cpu);
-
- if (false == clamping)
- goto exit_ok;
-
- switch (action) {
- case CPU_ONLINE:
- thread = kthread_create_on_node(clamp_thread,
- (void *) cpu,
- cpu_to_node(cpu),
- "kidle_inject/%lu", cpu);
- if (likely(!IS_ERR(thread))) {
- kthread_bind(thread, cpu);
- wake_up_process(thread);
- *percpu_thread = thread;
- }
- /* prefer BSP as controlling CPU */
- if (cpu == 0) {
- control_cpu = 0;
- smp_mb();
- }
- break;
- case CPU_DEAD:
- if (test_bit(cpu, cpu_clamping_mask)) {
- pr_err("cpu %lu dead but powerclamping thread is not\n",
- cpu);
- kthread_stop(*percpu_thread);
- }
- if (cpu == control_cpu) {
- control_cpu = smp_processor_id();
- smp_mb();
- }
- }
-
-exit_ok:
- return NOTIFY_OK;
+ clamp_cpus();
+ del_timer_sync(&wakeup_timer);
}

-static struct notifier_block powerclamp_cpu_notifier = {
- .notifier_call = powerclamp_cpu_callback,
-};
-
static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
unsigned long *state)
{
@@ -656,6 +534,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
}

exit_set:
+ clamp_cpus();
return ret;
}

@@ -764,11 +643,6 @@ static int powerclamp_init(void)
int retval;
int bitmap_size;

- bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
- cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
- if (!cpu_clamping_mask)
- return -ENOMEM;
-
/* probe cpu features and ids here */
retval = powerclamp_probe();
if (retval)
@@ -776,19 +650,12 @@ static int powerclamp_init(void)

/* set default limit, maybe adjusted during runtime based on feedback */
window_size = 2;
- register_hotcpu_notifier(&powerclamp_cpu_notifier);
-
- powerclamp_thread = alloc_percpu(struct task_struct *);
- if (!powerclamp_thread) {
- retval = -ENOMEM;
- goto exit_unregister;
- }

cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
&powerclamp_cooling_ops);
if (IS_ERR(cooling_dev)) {
retval = -ENODEV;
- goto exit_free_thread;
+ goto exit;
}

if (!duration)
@@ -798,23 +665,15 @@ static int powerclamp_init(void)

return 0;

-exit_free_thread:
- free_percpu(powerclamp_thread);
-exit_unregister:
- unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
-exit_free:
- kfree(cpu_clamping_mask);
+exit:
return retval;
}
module_init(powerclamp_init);

static void powerclamp_exit(void)
{
- unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
end_power_clamp();
- free_percpu(powerclamp_thread);
thermal_cooling_device_unregister(cooling_dev);
- kfree(cpu_clamping_mask);

cancel_delayed_work_sync(&poll_pkg_cstate_work);
debugfs_remove_recursive(debug_dir);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..6a7ccb2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3002,6 +3002,12 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);

#ifdef CONFIG_CGROUP_SCHED
extern struct task_group root_task_group;
+extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
+extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
+#else
+
+static inline int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
+static inline int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
#endif /* CONFIG_CGROUP_SCHED */

extern int task_can_switch_user(struct user_struct *up,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..f47621a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8059,9 +8059,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;

- if (tg == &root_task_group)
- return -EINVAL;
-
/*
* Ensure we have at some amount of bandwidth every period. This is
* to prevent reaching a state of large arrears when throttled via
@@ -8141,6 +8138,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)

return tg_set_cfs_bandwidth(tg, period, quota);
}
+EXPORT_SYMBOL_GPL(tg_set_cfs_quota);

long tg_get_cfs_quota(struct task_group *tg)
{
@@ -8164,6 +8162,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)

return tg_set_cfs_bandwidth(tg, period, quota);
}
+EXPORT_SYMBOL_GPL(tg_set_cfs_period);

long tg_get_cfs_period(struct task_group *tg)
{

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/