RE: [PATCH] [RFC] ondemand governor: dynamic cpufreq scaling withdifferent CPUs

From: Yung, Winson W
Date: Sun Jul 31 2011 - 21:08:26 EST


Thanks Venki for the comment, here is my thoughts on the change:

Correct me if I am wrong, it is probably true on PC that hw (bios) coordination mode for p-state is better over sw coordination mode, however on smartphone/tablet devices, there is no BIOS, so sw coordination mode is the only choice.

I am not sure I can simplify the change inside do_dbs_timer per your comment, dbs_info is per cpu, how are other CPUs suppose to see cpu change without getting that CPU's dbs_info? That's why I store the cpu and dbs_timer interval info inside policy which shared by all CPUs.

Having said that, how to handle cpu hotplug is something that my patch didn't do. If I offline one cpu and then online it again, it will no longer participate do_dbs_timer call. The code to add it back to dbs_timer cpus list and use the same shared policy is somewhat complicated. I am hoping to get some better idea on this from the community.

BTW, this delayed work queue problem should exist in both ondemand and conservative governors.

/Winson

>-----Original Message-----
>From: Venkatesh Pallipadi [mailto:venki@xxxxxxxxxx]
>Sent: Friday, July 29, 2011 9:53 AM
>To: Yung, Winson W
>Cc: linux-kernel@xxxxxxxxxxxxxxx; Van De Ven, Arjan
>Subject: Re: [PATCH] [RFC] ondemand governor: dynamic cpufreq scaling
>with different CPUs
>
>On Tue, Jul 26, 2011 at 10:22 AM, Yung, Winson W
><winson.w.yung@xxxxxxxxx> wrote:
>>
>> with using delayed work queue inside ondemand governor, I came to
>> realize that there are certain scenarios the governor may not
>> re-adjust cpu frequency swiftly for some duration. Here is a
>> brief description of the problem:
>>
>> Currently when ondemand governor starts, it will create kthreads
>> for each CPU, something like kondemand/0, kondemand/1 and so on.
>> I found from ftrace that kondemand/0 is the only kthread that
>> handles all the CPU frequency dynamic cpu frequency scaling, other
>> kthread such as kondemand/1 is never used. I also confirmed this
>> by debugging through the kernel code.
>
>This is not universally true. You are seeing this because the platform
>is using SOFTWARE_COORDINATION mode for P-states. Most optimal in
>terms of overhead is HARDWARE_COORDINATION mode, and this used to be
>the recommended mode for platforms. And HW coordination is the
>(atleast used to be) the more common mode across platforms.
>Not sure whether this platform has SW coordination as a feature or bug?
>
>>
>> So if there is no timer fired on CPU-0 (i.e. CPU-0 in C6), hence
>> no deferrable timers will ever fired by kondemand/0 to sample all
>> CPU workload. Since it is the only kthread doing the CPU freqency
>> scaling for all CPUs, it is possible to enter in the situation
>> where CPU frequency scaling will stop working for some duration
>> until CPU-0 becomes not idle again.
>
>Yes. This will be a problem.
>But, I am not sure whether the added complexity with the change below
>to support the sub-optimal mode here. But, if we have a huge number of
>platforms that are using software coordination, then there may not be
>much of a choice here :(
>
>>
>> Signed-off-by: Hari K Kanigeri <hari.kkanigeri@xxxxxxxxx>
>> ---
>>  drivers/cpufreq/cpufreq_ondemand.c |   93
>+++++++++++++++++++++++++++++++----
>>  include/linux/cpufreq.h            |    3 +
>>  2 files changed, 85 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq_ondemand.c
>b/drivers/cpufreq/cpufreq_ondemand.c
>> index 891360e..b4f4f9d 100644
>> --- a/drivers/cpufreq/cpufreq_ondemand.c
>> +++ b/drivers/cpufreq/cpufreq_ondemand.c
>> @@ -541,6 +541,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s
>*this_dbs_info)
>>
>>  static void do_dbs_timer(struct work_struct *work)
>>  {
>> +       u64 old_time, new_do_dbs_time;
>> +       u64 saved_last_time;
>> +
>>        struct cpu_dbs_info_s *dbs_info =
>>                container_of(work, struct cpu_dbs_info_s, work.work);
>>        unsigned int cpu = dbs_info->cpu;
>> @@ -548,6 +551,41 @@ static void do_dbs_timer(struct work_struct *work)
>>
>>        int delay;
>>
>> +       new_do_dbs_time = get_jiffies_64();
>> +
>> +       /* Save a copy of last_do_dbs_time do_dbs_timer */
>> +
>> +       saved_last_time = atomic64_read(
>> +               &dbs_info->cur_policy->last_do_dbs_time);
>> +
>> +       /* If the last time do_dbs_timer ran is less than */
>> +       /* 'delay' delta from current time, one of other  */
>> +       /* CPUs is still handling cpufreq governoring for */
>> +       /* for all CPUs. */
>> +
>> +       if ((new_do_dbs_time - saved_last_time) <
>> +               dbs_info->cur_policy->last_delay) {
>> +
>> +               /* If do_dbs_time ran not long ago */
>> +               /* (delta is less than last_delay) */
>> +               /* then bail out because there is  */
>> +               /* no need to run so frequently.   */
>> +               goto do_exit;
>> +       }
>> +
>> +       old_time = atomic64_cmpxchg(&dbs_info->cur_policy-
>>last_do_dbs_time,
>> +                                       saved_last_time,
>new_do_dbs_time);
>> +
>> +       /* If old_time is not equal to saved_last_time, */
>> +       /* it means another CPU is calling do_dbs_time. */
>> +
>> +       if (old_time != saved_last_time)
>> +               goto do_exit;
>> +
>> +       /* Switch cpu saved in the policy */
>> +       if (dbs_info->cur_policy->cpu != cpu)
>> +               dbs_info->cur_policy->cpu = cpu;
>> +
>
>
>This seems somewhat complicated way to do this.. One simpler way could
>be to have a new atomic element in dbs_info that stores the current
>CPU among policy->cpus doing dbs_timers. By default it will be
>policy->cpu and when it goes idle it changes the variable to -1, so
>that any other CPU can nominate itself (using atomic cmpxchg). If it
>is still -1, then policy->cpu does the check anyway.
>
>>        mutex_lock(&dbs_info->timer_mutex);
>>
>>        /* Common NORMAL_SAMPLE setup */
>> @@ -559,6 +597,7 @@ static void do_dbs_timer(struct work_struct *work)
>>                        /* Setup timer for SUB_SAMPLE */
>>                        dbs_info->sample_type = DBS_SUB_SAMPLE;
>>                        delay = dbs_info->freq_hi_jiffies;
>> +                       dbs_info->cur_policy->last_delay = delay;
>>                } else {
>>                        /* We want all CPUs to do sampling nearly on
>>                         * same jiffy
>> @@ -566,14 +605,18 @@ static void do_dbs_timer(struct work_struct
>*work)
>>                        delay =
>usecs_to_jiffies(dbs_tuners_ins.sampling_rate
>>                                * dbs_info->rate_mult);
>>
>> -                       if (num_online_cpus() > 1)
>> +                       if (num_online_cpus() > 1) {
>>                                delay -= jiffies % delay;
>> +                               dbs_info->cur_policy->last_delay =
>delay;
>> +                       }
>>                }
>>        } else {
>>                __cpufreq_driver_target(dbs_info->cur_policy,
>>                        dbs_info->freq_lo, CPUFREQ_RELATION_H);
>>                delay = dbs_info->freq_lo_jiffies;
>> +               dbs_info->cur_policy->last_delay = delay;
>>        }
>> +do_exit:
>>        schedule_delayed_work_on(cpu, &dbs_info->work, delay);
>>        mutex_unlock(&dbs_info->timer_mutex);
>>  }
>> @@ -648,10 +691,12 @@ static int cpufreq_governor_dbs(struct
>cpufreq_policy *policy,
>>                                j_dbs_info->prev_cpu_nice =
>>                                                kstat_cpu(j).cpustat.ni
>ce;
>>                        }
>> +
>> +                       j_dbs_info->cpu = j;
>> +                       j_dbs_info->rate_mult = 1;
>> +                       ondemand_powersave_bias_init_cpu(j);
>>                }
>> -               this_dbs_info->cpu = cpu;
>> -               this_dbs_info->rate_mult = 1;
>> -               ondemand_powersave_bias_init_cpu(cpu);
>> +
>>                /*
>>                 * Start the timerschedule work, when this governor
>>                 * is used for first time
>> @@ -680,32 +725,58 @@ static int cpufreq_governor_dbs(struct
>cpufreq_policy *policy,
>>                }
>>                mutex_unlock(&dbs_mutex);
>>
>> -               mutex_init(&this_dbs_info->timer_mutex);
>> -               dbs_timer_init(this_dbs_info);
>> +               for_each_cpu(j, policy->cpus) {
>> +                       struct cpu_dbs_info_s *j_dbs_info;
>> +                       j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>> +                       mutex_init(&j_dbs_info->timer_mutex);
>> +                       dbs_timer_init(j_dbs_info);
>> +               }
>>                break;
>>
>>        case CPUFREQ_GOV_STOP:
>> -               dbs_timer_exit(this_dbs_info);
>> +               for_each_cpu(j, policy->cpus) {
>> +                       struct cpu_dbs_info_s *j_dbs_info;
>> +                       j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>> +                       dbs_timer_exit(j_dbs_info);
>> +               }
>>
>>                mutex_lock(&dbs_mutex);
>> -               mutex_destroy(&this_dbs_info->timer_mutex);
>> +
>> +               for_each_cpu(j, policy->cpus) {
>> +                       struct cpu_dbs_info_s *j_dbs_info;
>> +                       j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>> +                       mutex_destroy(&j_dbs_info->timer_mutex);
>> +               }
>> +
>>                dbs_enable--;
>>                mutex_unlock(&dbs_mutex);
>>                if (!dbs_enable)
>>                        sysfs_remove_group(cpufreq_global_kobject,
>>                                           &dbs_attr_group);
>> -
>>                break;
>>
>>        case CPUFREQ_GOV_LIMITS:
>> -               mutex_lock(&this_dbs_info->timer_mutex);
>> +               /* Since ondemand governor is no longer running on the
>*/
>> +               /* same CPU anymore,need to mutex_lock all timer_mutex
>*/
>> +               /* before calling __cpufreq_driver_target. */
>> +               for_each_cpu(j, policy->cpus) {
>> +                       struct cpu_dbs_info_s *j_dbs_info;
>> +                       j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>> +                       mutex_lock(&j_dbs_info->timer_mutex);
>
>Not sure why we need a lock on all policy->cpus mutexes here. Could as
>well be all policy->cpus use policy->cpu's mutex. No?
>
>
>> +               }
>> +
>>                if (policy->max < this_dbs_info->cur_policy->cur)
>>                        __cpufreq_driver_target(this_dbs_info-
>>cur_policy,
>>                                policy->max, CPUFREQ_RELATION_H);
>>                else if (policy->min > this_dbs_info->cur_policy->cur)
>>                        __cpufreq_driver_target(this_dbs_info-
>>cur_policy,
>>                                policy->min, CPUFREQ_RELATION_L);
>> -               mutex_unlock(&this_dbs_info->timer_mutex);
>> +
>> +               for_each_cpu(j, policy->cpus) {
>> +                       struct cpu_dbs_info_s *j_dbs_info;
>> +                       j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>> +                       mutex_unlock(&j_dbs_info->timer_mutex);
>> +               }
>>                break;
>>        }
>>        return 0;
>> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
>> index 11be48e..2295a35 100644
>> --- a/include/linux/cpufreq.h
>> +++ b/include/linux/cpufreq.h
>> @@ -106,6 +106,9 @@ struct cpufreq_policy {
>>
>>        struct kobject          kobj;
>>        struct completion       kobj_unregister;
>> +
>> +       atomic64_t              last_do_dbs_tim;
>> +       int                     last_delay;
>>  };
>>
>>  #define CPUFREQ_ADJUST         (0)
>> --
>> 1.7.1
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/