Re: [RFCv5 PATCH 43/46] sched/{fair,cpufreq_sched}: add reset_capacity interface

From: Steve Muckle
Date: Mon Oct 12 2015 - 15:02:18 EST


On 10/09/2015 02:14 AM, Juri Lelli wrote:
>> Though I understand the initial stated motivation here (avoiding a
>> > redundant capacity request upon idle entry), releasing the CPU's
>> > capacity request altogether on idle seems like it could be a contentious
>> > policy decision.
>> >
>> > An example to illustrate my concern:
>> > - 2 CPU single frequency domain topology
>> > - task A is a small frequently-running task on CPU0
>> > - task B is a heavier intermittent task running on CPU1
>> >
>> > Task B is driving the frequency of the cluster high, but whenever it
>> > sleeps CPU1 becomes idle and the capacity request is dropped. If there's
>> > any activity on CPU0 that causes cpufreq_sched_set_cap() to be called
>> > (which is likely, given task A runs often) the cluster frequency will be
>> > lowered. Task B's performance will be impacted when it wakes up because
>> > initially the OPP will be insufficient. Power may or may not be
>
> With the current implementation you are right: B's util will be decayed
> and it will have to build it up again, loosing in performance. What
> about we try to change this as discussed at Connect? At enqueue time we
> use pre-decayed B's util, so that it will generate an OPP transition
> at the required capacity on wakeup.

Actually I wasn't even really considering the decay of B's utilization -
just that the CPU OPP will have been lowered due to the reset of CPU1's
reservation when B slept and subsequent task activity on CPU0, and then
will have to be raised (to something, depending on whether pre or post
decayed utilization is used) when B wakes. The latency of OPP
transitions may be considerable, or at least nontrivial, compared to a
task's wake/sleep pattern, meaning that a good portion of the task
activity may occur while the OPP is suboptimal for that task. Frequent
OPP transitions may also have a nontrivial overhead in terms of CPU
usage and energy.

I don't have an opinion to offer at the moment on using the pre or post
decayed utilization in enqueue. That seems like a tough policy choice
which may require a lot of power/perf data to clearly justify either
way. My concern here is limited to whether a CPU's dvfs
contribution/vote should be entirely removed when the last task on it is
dequeued, or removed gradually (decayed) over time, or removed entirely
after some timeout etc.

>> > The decision of when a CPU's vote should be decayed or removed is more
>> > policy where I believe there's no single right answer and in the past,
>> > has been solved with tunables. The interactive governor's slack timer
>> > controls how long it will allow an idle CPU to request a frequency > fmin.
>> >
>
> Mmm, IMHO there is still a bit of space for trying to make the current
> implementation better, before we give up and go to add a tunable :-).

Agreed. As a tunable apologist my attempt to offer background on one way
this is solved today ended up looking more like a request :) .
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/