Re: [PATCH v1 3/7] thermal: cpu_cooling: implement the power cooling device API

From: Javi Merino
Date: Thu Jan 29 2015 - 14:11:49 EST


On Wed, Jan 28, 2015 at 05:56:08PM +0000, Eduardo Valentin wrote:
> On Wed, Jan 28, 2015 at 05:00:34PM +0000, Javi Merino wrote:
> > Add a basic power model to the cpu cooling device to implement the
> > power cooling device API. The power model uses the current frequency,
> > current load and OPPs for the power calculations. The cpus must have
> > registered their OPPs using the OPP library.
> >
> > Cc: Zhang Rui <rui.zhang@xxxxxxxxx>
> > Cc: Eduardo Valentin <edubezval@xxxxxxxxx>
> > Signed-off-by: Punit Agrawal <punit.agrawal@xxxxxxx>
> > Signed-off-by: Javi Merino <javi.merino@xxxxxxx>
> > ---
> > Documentation/thermal/cpu-cooling-api.txt | 156 +++++++++-
> > drivers/thermal/cpu_cooling.c | 480 +++++++++++++++++++++++++++++-
> > include/linux/cpu_cooling.h | 39 +++
> > 3 files changed, 670 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/thermal/cpu-cooling-api.txt b/Documentation/thermal/cpu-cooling-api.txt
> > index 753e47cc2e20..71653584cd03 100644
> > --- a/Documentation/thermal/cpu-cooling-api.txt
> > +++ b/Documentation/thermal/cpu-cooling-api.txt
> > @@ -36,8 +36,162 @@ the user. The registration APIs returns the cooling device pointer.
> > np: pointer to the cooling device device tree node
> > clip_cpus: cpumask of cpus where the frequency constraints will happen.
> >
> > -1.1.3 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> > +1.1.3 struct thermal_cooling_device *cpufreq_power_cooling_register(
> > + const struct cpumask *clip_cpus, u32 capacitance,
> > + get_static_t plat_static_func)
> > +
> > +Similar to cpufreq_cooling_register, this function registers a cpufreq
> > +cooling device. Using this function, the cooling device will
> > +implement the power extensions by using a simple cpu power model. The
> > +cpus must have registered their OPPs using the OPP library.
> > +
> > +The additional parameters are needed for the power model (See 2. Power
> > +models). "capacitance" is the dynamic power coefficient (See 2.1
> > +Dynamic power). "plat_static_func" is a function to calculate the
> > +static power consumed by these cpus (See 2.2 Static power).
> > +
> > +1.1.4 struct thermal_cooling_device *of_cpufreq_power_cooling_register(
> > + struct device_node *np, const struct cpumask *clip_cpus, u32 capacitance,
> > + get_static_t plat_static_func)
> > +
> > +Similar to cpufreq_power_cooling_register, this function register a
> > +cpufreq cooling device with power extensions using the device tree
> > +information supplied by the np parameter.
> > +
> > +1.1.5 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> >
> > This interface function unregisters the "thermal-cpufreq-%x" cooling device.
> >
> > cdev: Cooling device pointer which has to be unregistered.
> > +
> > +2. Power models
> > +
> > +The power API registration functions provide a simple power model for
> > +CPUs. The current power is calculated as dynamic + (optionally)
> > +static power. This power model requires that the operating-points of
> > +the CPUs are registered using the kernel's opp library and the
> > +`cpufreq_frequency_table` is assigned to the `struct device` of the
> > +cpu. If you are using CONFIG_CPUFREQ_DT then the
> > +`cpufreq_frequency_table` should already be assigned to the cpu
> > +device.
> > +
> > +The `plat_static_func` parameter of `cpufreq_power_cooling_register()`
> > +and `of_cpufreq_power_cooling_register()` is optional. If you don't
> > +provide it, only dynamic power will be considered.
> > +
> > +2.1 Dynamic power
> > +
> > +The dynamic power consumption of a processor depends on many factors.
> > +For a given processor implementation the primary factors are:
> > +
> > +- The time the processor spends running, consuming dynamic power, as
> > + compared to the time in idle states where dynamic consumption is
> > + negligible. Herein we refer to this as 'utilisation'.
> > +- The voltage and frequency levels as a result of DVFS. The DVFS
> > + level is a dominant factor governing power consumption.
> > +- In running time the 'execution' behaviour (instruction types, memory
> > + access patterns and so forth) causes, in most cases, a second order
> > + variation. In pathological cases this variation can be significant,
> > + but typically it is of a much lesser impact than the factors above.
> > +
> > +A high level dynamic power consumption model may then be represented as:
> > +
> > +Pdyn = f(run) * Voltage^2 * Frequency * Utilisation
> > +
> > +f(run) here represents the described execution behaviour and its
> > +result has a units of Watts/Hz/Volt^2 (this often expressed in
> > +mW/MHz/uVolt^2)
> > +
> > +The detailed behaviour for f(run) could be modelled on-line. However,
> > +in practice, such an on-line model has dependencies on a number of
> > +implementation specific processor support and characterisation
> > +factors. Therefore, in initial implementation that contribution is
> > +represented as a constant coefficient. This is a simplification
> > +consistent with the relative contribution to overall power variation.
> > +
> > +In this simplified representation our model becomes:
> > +
> > +Pdyn = Capacitance * Voltage^2 * Frequency * Utilisation
> > +
> > +Where `capacitance` is a constant that represents an indicative
> > +running time dynamic power coefficient in fundamental units of
> > +mW/MHz/uVolt^2. Typical values for mobile CPUs might lie in range
> > +from 100 to 500. For reference, the approximate values for the SoC in
> > +ARM's Juno Development Platform are 530 for the Cortex-A57 cluster and
> > +140 for the Cortex-A53 cluster.
> > +
> > +
> > +2.2 Static power
> > +
> > +Static leakage power consumption depends on a number of factors. For a
> > +given circuit implementation the primary factors are:
> > +
> > +- Time the circuit spends in each 'power state'
> > +- Temperature
> > +- Operating voltage
> > +- Process grade
> > +
> > +The time the circuit spends in each 'power state' for a given
> > +evaluation period at first order means OFF or ON. However,
> > +'retention' states can also be supported that reduce power during
> > +inactive periods without loss of context.
> > +
> > +Note: The visibility of state entries to the OS can vary, according to
> > +platform specifics, and this can then impact the accuracy of a model
> > +based on OS state information alone. It might be possible in some
> > +cases to extract more accurate information from system resources.
> > +
> > +The temperature, operating voltage and process 'grade' (slow to fast)
> > +of the circuit are all significant factors in static leakage power
> > +consumption. All of these have complex relationships to static power.
> > +
> > +Circuit implementation specific factors include the chosen silicon
> > +process as well as the type, number and size of transistors in both
> > +the logic gates and any RAM elements included.
> > +
> > +The static power consumption modelling must take into account the
> > +power managed regions that are implemented. Taking the example of an
> > +ARM processor cluster, the modelling would take into account whether
> > +each CPU can be powered OFF separately or if only a single power
> > +region is implemented for the complete cluster.
> > +
> > +In one view, there are others, a static power consumption model can
> > +then start from a set of reference values for each power managed
> > +region (e.g. CPU, Cluster/L2) in each state (e.g. ON, OFF) at an
> > +arbitrary process grade, voltage and temperature point. These values
> > +are then scaled for all of the following: the time in each state, the
> > +process grade, the current temperature and the operating voltage.
> > +However, since both implementation specific and complex relationships
> > +dominate the estimate, the appropriate interface to the model from the
> > +cpu cooling device is to provide a function callback that calculates
> > +the static power in this platform. When registering the cpu cooling
> > +device pass a function pointer that follows the `get_static_t`
> > +prototype:
> > +
> > + int plat_get_static(cpumask_t *cpumask, int interval,
> > + unsigned long voltage, u32 &power);
> > +
> > +`cpumask` is the cpumask of the cpus involved in the calculation.
> > +`voltage` is the voltage at which they are operating. The function
> > +should calculate the average static power for the last `interval`
> > +milliseconds. It returns 0 on success, -E* on error. If it
> > +succeeds, it should store the static power in `power`. Reading the
> > +temperature of the cpus described by `cpumask` is left for
> > +plat_get_static() to do as the platform knows best which thermal
> > +sensor is closest to the cpu.
> > +
> > +If `plat_static_func` is NULL, static power is considered to be
> > +negligible for this platform and only dynamic power is considered.
> > +
> > +The platform specific callback can then use any combination of tables
> > +and/or equations to permute the estimated value. Process grade
> > +information is not passed to the model since access to such data, from
> > +on-chip measurement capability or manufacture time data, is platform
> > +specific.
> > +
> > +Note: the significance of static power for CPUs in comparison to
> > +dynamic power is highly dependent on implementation. Given the
> > +potential complexity in implementation, the importance and accuracy of
> > +its inclusion when using cpu cooling devices should be assessed on a
> > +case by case basis.
> > +
> > diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> > index f65f0d109fc8..a639aaf228f5 100644
> > --- a/drivers/thermal/cpu_cooling.c
> > +++ b/drivers/thermal/cpu_cooling.c
> > @@ -26,6 +26,7 @@
> > #include <linux/thermal.h>
> > #include <linux/cpufreq.h>
> > #include <linux/err.h>
> > +#include <linux/pm_opp.h>
> > #include <linux/slab.h>
> > #include <linux/cpu.h>
> > #include <linux/cpu_cooling.h>
> > @@ -45,6 +46,19 @@
> > */
> >
> > /**
> > + * struct power_table - frequency to power conversion
> > + * @frequency: frequency in KHz
> > + * @power: power in mW
> > + *
> > + * This structure is built when the cooling device registers and helps
> > + * in translating frequency to power and viceversa.
> > + */
> > +struct power_table {
> > + u32 frequency;
> > + u32 power;
> > +};
> > +
> > +/**
> > * struct cpufreq_cooling_device - data for cooling device with cpufreq
> > * @id: unique integer value corresponding to each cpufreq_cooling_device
> > * registered.
> > @@ -58,6 +72,15 @@
> > * cpufreq frequencies.
> > * @allowed_cpus: all the cpus involved for this cpufreq_cooling_device.
> > * @node: list_head to link all cpufreq_cooling_device together.
> > + * @last_load: load measured by the latest call to cpufreq_get_actual_power()
> > + * @time_in_idle: previous reading of the absolute time that this cpu was idle
> > + * @time_in_idle_timestamp: wall time of the last invocation of
> > + * get_cpu_idle_time_us()
> > + * @dyn_power_table: array of struct power_table for frequency to power
> > + * conversion, sorted in ascending order.
> > + * @dyn_power_table_entries: number of entries in the @dyn_power_table array
> > + * @cpu_dev: the first cpu_device from @allowed_cpus that has OPPs registered
> > + * @plat_get_static_power: callback to calculate the static power
> > *
> > * This structure is required for keeping information of each registered
> > * cpufreq_cooling_device.
> > @@ -71,6 +94,13 @@ struct cpufreq_cooling_device {
> > unsigned int *freq_table; /* In descending order */
> > struct cpumask allowed_cpus;
> > struct list_head node;
> > + u32 last_load;
> > + u64 time_in_idle[NR_CPUS];
> > + u64 time_in_idle_timestamp[NR_CPUS];
> > + struct power_table *dyn_power_table;
> > + int dyn_power_table_entries;
> > + struct device *cpu_dev;
> > + get_static_t plat_get_static_power;
> > };
> > static DEFINE_IDR(cpufreq_idr);
> > static DEFINE_MUTEX(cooling_cpufreq_lock);
> > @@ -205,6 +235,210 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
> > return 0;
> > }
> >
> > +/**
> > + * build_dyn_power_table() - create a dynamic power to frequency table
> > + * @cpufreq_device: the cpufreq cooling device in which to store the table
> > + * @capacitance: dynamic power coefficient for these cpus
> > + *
> > + * Build a dynamic power to frequency table for this cpu and store it
> > + * in @cpufreq_device. This table will be used in cpu_power_to_freq() and
> > + * cpu_freq_to_power() to convert between power and frequency
> > + * efficiently. Power is stored in mW, frequency in KHz. The
> > + * resulting table is in ascending order.
> > + *
> > + * Return: 0 on success, -E* on error.
> > + */
> > +static int build_dyn_power_table(struct cpufreq_cooling_device *cpufreq_device,
> > + u32 capacitance)
> > +{
> > + struct power_table *power_table;
> > + struct dev_pm_opp *opp;
> > + struct device *dev = NULL;
> > + int num_opps = 0, cpu, i, ret = 0;
> > + unsigned long freq;
> > +
> > + rcu_read_lock();
> > +
> > + for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
> > + dev = get_cpu_device(cpu);
> > + if (!dev) {
> > + dev_warn(&cpufreq_device->cool_dev->device,
> > + "No cpu device for cpu %d\n", cpu);
> > + continue;
> > + }
> > +
> > + num_opps = dev_pm_opp_get_opp_count(dev);
> > + if (num_opps > 0) {
> > + break;
> > + } else if (num_opps < 0) {
> > + ret = num_opps;
> > + goto unlock;
> > + }
> > + }
> > +
> > + if (num_opps == 0) {
> > + ret = -EINVAL;
> > + goto unlock;
> > + }
> > +
> > + power_table = kcalloc(num_opps, sizeof(*power_table), GFP_KERNEL);
> > +
> > + for (freq = 0, i = 0;
> > + opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
> > + freq++, i++) {
> > + u32 freq_mhz, voltage_mv;
> > + u64 power;
> > +
> > + freq_mhz = freq / 1000000;
> > + voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
> > +
> > + /*
> > + * Do the multiplication with MHz and millivolt so as
> > + * to not overflow.
> > + */
> > + power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
> > + do_div(power, 1000000000);
> > +
> > + /* frequency is stored in power_table in KHz */
> > + power_table[i].frequency = freq / 1000;
> > +
> > + /* power is stored in mW */
> > + power_table[i].power = power;
> > + }
> > +
> > + if (i == 0) {
> > + ret = PTR_ERR(opp);
> > + goto unlock;
> > + }
> > +
> > + cpufreq_device->cpu_dev = dev;
> > + cpufreq_device->dyn_power_table = power_table;
> > + cpufreq_device->dyn_power_table_entries = i;
> > +
> > +unlock:
> > + rcu_read_unlock();
> > + return ret;
> > +}
> > +
> > +static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device,
> > + u32 freq)
> > +{
> > + int i;
> > + struct power_table *pt = cpufreq_device->dyn_power_table;
> > +
> > + for (i = 1; i < cpufreq_device->dyn_power_table_entries; i++)
> > + if (freq < pt[i].frequency)
> > + break;
> > +
> > + return pt[i - 1].power;
> > +}
> > +
> > +static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device,
> > + u32 power)
> > +{
> > + int i;
> > + struct power_table *pt = cpufreq_device->dyn_power_table;
> > +
> > + for (i = 1; i < cpufreq_device->dyn_power_table_entries; i++)
> > + if (power < pt[i].power)
> > + break;
> > +
> > + return pt[i - 1].frequency;
> > +}
> > +
> > +/**
> > + * get_load() - get load for a cpu since last updated
> > + * @cpufreq_device: &struct cpufreq_cooling_device for this cpu
> > + * @cpu: cpu number
> > + *
> > + * Return: The average load of cpu @cpu in percentage since this
> > + * function was last called.
> > + */
> > +static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu)
> > +{
> > + u32 load;
> > + u64 now, now_idle, delta_time, delta_idle;
> > +
> > + now_idle = get_cpu_idle_time(cpu, &now, 0);
> > + delta_idle = now_idle - cpufreq_device->time_in_idle[cpu];
> > + delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu];
> > +
> > + if (delta_time <= delta_idle)
> > + load = 0;
> > + else
> > + load = div64_u64(100 * (delta_time - delta_idle), delta_time);
> > +
> > + cpufreq_device->time_in_idle[cpu] = now_idle;
> > + cpufreq_device->time_in_idle_timestamp[cpu] = now;
> > +
> > + return load;
> > +}
> > +
> > +/**
> > + * get_static_power() - calculate the static power consumed by the cpus
> > + * @cpufreq_device: struct &cpufreq_cooling_device for this cpu cdev
> > + * @tz: thermal zone device in which we're operating
> > + * @freq: frequency in KHz
> > + * @power: pointer in which to store the calculated static power
> > + *
> > + * Calculate the static power consumed by the cpus described by
> > + * @cpu_actor running at frequency @freq. This function relies on a
> > + * platform specific function that should have been provided when the
> > + * actor was registered. If it wasn't, the static power is assumed to
> > + * be negligible. The calculated static power is stored in @power.
> > + *
> > + * Return: 0 on success, -E* on failure.
> > + */
> > +static int get_static_power(struct cpufreq_cooling_device *cpufreq_device,
> > + struct thermal_zone_device *tz, unsigned long freq,
> > + u32 *power)
> > +{
> > + struct dev_pm_opp *opp;
> > + unsigned long voltage;
> > + struct cpumask *cpumask = &cpufreq_device->allowed_cpus;
> > + unsigned long freq_hz = freq * 1000;
> > +
> > + if (!cpufreq_device->plat_get_static_power) {
> > + *power = 0;
> > + return 0;
> > + }
> > +
> > + rcu_read_lock();
> > +
> > + opp = dev_pm_opp_find_freq_exact(cpufreq_device->cpu_dev, freq_hz,
> > + true);
> > + voltage = dev_pm_opp_get_voltage(opp);
> > +
> > + rcu_read_unlock();
> > +
> > + if (voltage == 0) {
> > + dev_warn_ratelimited(cpufreq_device->cpu_dev,
> > + "Failed to get voltage for frequency %lu: %ld\n",
> > + freq_hz, IS_ERR(opp) ? PTR_ERR(opp) : 0);
> > + return -EINVAL;
> > + }
> > +
> > + return cpufreq_device->plat_get_static_power(cpumask, tz->passive_delay,
> > + voltage, power);
> > +}
> > +
> > +/**
> > + * get_dynamic_power() - calculate the dynamic power
> > + * @cpufreq_device: &cpufreq_cooling_device for this cdev
> > + * @freq: current frequency
> > + *
> > + * Return: the dynamic power consumed by the cpus described by
> > + * @cpufreq_device.
> > + */
> > +static u32 get_dynamic_power(struct cpufreq_cooling_device *cpufreq_device,
> > + unsigned long freq)
> > +{
> > + u32 raw_cpu_power;
> > +
> > + raw_cpu_power = cpu_freq_to_power(cpufreq_device, freq);
> > + return (raw_cpu_power * cpufreq_device->last_load) / 100;
> > +}
> > +
> > /* cpufreq cooling device callback functions are defined below */
> >
> > /**
> > @@ -280,8 +514,161 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> > return 0;
> > }
> >
> > +/**
> > + * cpufreq_get_requested_power() - get the current power
> > + * @cdev: &thermal_cooling_device pointer
> > + * @tz: a valid thermal zone device pointer
> > + * @power: pointer in which to store the resulting power
> > + *
> > + * Calculate the current power consumption of the cpus in milliwatts
> > + * and store it in @power. This function should actually calculate
> > + * the requested power, but it's hard to get the frequency that
> > + * cpufreq would have assigned if there were no thermal limits.
> > + * Instead, we calculate the current power on the assumption that the
> > + * immediate future will look like the immediate past.
> > + *
> > + * Return: 0 on success, -E* if getting the static power failed.
> > + */
> > +static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
> > + struct thermal_zone_device *tz,
> > + u32 *power)
> > +{
> > + unsigned long freq;
> > + int cpu, ret;
> > + u32 static_power, dynamic_power, total_load = 0;
> > + struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> > +
> > + freq = cpufreq_quick_get(cpumask_any(&cpufreq_device->allowed_cpus));
> > +
> > + for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
> > + u32 load;
> > +
> > + if (cpu_online(cpu))
> > + load = get_load(cpufreq_device, cpu);
> > + else
> > + load = 0;
> > +
> > + total_load += load;
> > + }
> > +
> > + cpufreq_device->last_load = total_load;
> > +
> > + dynamic_power = get_dynamic_power(cpufreq_device, freq);
> > + ret = get_static_power(cpufreq_device, tz, freq, &static_power);
> > + if (ret)
> > + return ret;
> > +
> > + *power = static_power + dynamic_power;
> > + return 0;
> > +}
>
> Repeating the query I've just made on v5, do we care if the system uses
> different opps during the load sampling interval?
>
> Meaning, 1 - idle might not reflect the correct load.

Similarly to what we discussed in the other thread, we know it's a
simplification and we haven't seen it affecting performance. We will
add a comment that clarifies this in the code.

Cheers,
Javi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/