Re: [RFC 0/1] Weighted approach to gather and use history in TEO governor

From: Gautham R Shenoy
Date: Tue Feb 25 2020 - 00:13:25 EST


Hello Pratik,

On Sat, Feb 22, 2020 at 12:30:01PM +0530, Pratik Rajesh Sampat wrote:
> Currently the TEO governor apart from the TEO timer and hit/miss/early
> hit buckets; also gathers history of 8 intervals and if there are
> significant idle durations less than the current, then it decides if a
> shallower state must be chosen.
>
> The current sliding history window does do a fair job at prediction,
> however, the hard-coded window can be a limiting factor for an accurate
> prediction and having the window size increase can also linearly affect
> both space and time complexity of the prediction.
>
> To complement the current moving window history, an approach is devised
> where each idle state separately maintains a weight for itself and its
> counterpart idle states to form a probability distribution.
>
> When a decision needs to be made, the TEO governor selects an idle state
> based on its timer and other hits/early hits metric. After which, the
> probability distribution of that selected idle state is looked at which
> gives insight into how probable that state is to occur if picked.
>
> The probability distribution is nothing but a n*n matrix, where
> n = drv->state_count.
> Each entry in the array signifies a weight for that row.
> The weights can vary from the range [0-10000].
>
> For example:
> state_mat[1][2] = 3000 means that previously when state 1 was selected,
> the probability that state 2 will occur is 30%.

Could you clarify what this means ? Do you mean that when state 1 is
selected, the probability that the CPU will be in state 1 for the
duration corresponding to state 2's residency is 30% ?

Further more, this means that during idle state selection we have O(n)
complexity if n is the number of idle states, since we want to select
a state where we are more likely to reside ?

> The trailing zeros correspond to having more resolution while increasing
> or reducing the weights for correction.
>
> Currently, for selection of an idle state based on probabilities, a
> weighted random number generator is used to choose one of the idle
> states. Naturally, the states with higher weights are more likely to be
> chosen.
>
> On wakeup, the weights are updated. The state with which it should have
> woken up with (could be the hit / miss / early hit state) is increased
> in weight by the "LEARNING_RATE" % and the rest of the states for that
> index are reduced by the same factor.

So we only update the weight in just one cell ?

To use the example above, if we selected state 1, and we resided in it
for a duration corresponding to state 2's residency, we will only
update state_mat[1][2] ?

>
> The advantage of this approach is that unlimited history of idle states
> can be maintained in constant overhead, which can help in more accurate
> prediction for choosing idle states.
>
> The advantage of unlimited history can become a possible disadvantage as
> the lifetime history for that thread may make the weights stale and
> influence the choosing of idle states which may not be relevant
> anymore.

Can the effect of this staleless be observed ? For instance, if we
have a particular idle entry/exit pattern for a very long duration,
say a few 10s of minutes and then the idle entry/exit pattern changes,
how bad will the weighted approach be compared to the current TEO
governor ?




> Aging the weights could be a solution for that, although this RFC does
> not cover the implementation for that.
>
> Having a finer view of the history in addition to weighted randomized
> salt seems to show some promise in terms of saving power without
> compromising performance.
>
> Benchmarks:
> Note: Wt. TEO governor represents the governor after the proposed change
>
> Schbench
> ========
> Benchmarks wakeup latencies
> Scale of measurement:
> 1. 99th percentile latency - usec
> 2. Power - Watts
>
> Command: $ schbench -c 30000 -s 30000 -m 6 -r 30 -t <Threads>
> Varying parameter: -t
>
> Machine: IBM POWER 9
>
> +--------+-------------+-----------------+-----------+-----------------+
> | Threads| TEO latency | Wt. TEO latency | TEO power | Wt. TEO power |
> +--------+-------------+-----------------+-----------+-----------------+
> | 2 | 979 | 949 ( +3.06%) | 38 | 36 ( +5.26%) |
> | 4 | 997 | 1042 ( -4.51%) | 51 | 39 ( +23.52%) |
> | 8 | 1158 | 1050 ( +9.32%) | 89 | 63 ( +29.21%) |
> | 16 | 1138 | 1135 ( +0.26%) | 105 | 117 ( -11.42%) |
> +--------+-------------+-----------------+-----------+-----------------+
>
> Sleeping Ebizzy
> ===============
> Program to generate workloads resembling web server workloads.
> The benchmark is customized to allow for a sleep interval -i
> Scale of measurement:
> 1. Number of records/s
> 2. systime (s)
>
> Parameters:
> 1. -m => Always use mmap instead of malloc
> 2. -M => Never use mmap
> 3. -S <seconds> => Number of seconds to run
> 4. -i <interval> => Sleep interval
>
> Machine: IBM POWER 9
>
> +-------------------+-------------+-------------------+-----------+---------------+
> | Parameters | TEO records | Wt. TEO records | TEO power | Wt. TEO power |
> +-------------------+-------------+-------------------+-----------+---------------+
> | -S 60 -i 10000 | 1115000 | 1198081 ( +7.45%) | 149 | 150 ( -0.66%) |
> | -m -S 60 -i 10000 | 15879 | 15513 ( -2.30%) | 23 | 22 ( +4.34%) |
> | -M -S 60 -i 10000 | 72887 | 77546 ( +6.39%) | 104 | 103 ( +0.96%) |
> +-------------------+-------------+-------------------+-----------+---------------+
>
> Hackbench
> =========
> Creates a specified number of pairs of schedulable entities
> which communicate via either sockets or pipes and time how long it
> takes for each pair to send data back and forth.
> Scale of measurement:
> 1. Time (s)
> 2. Power (watts)
>
> Command: Sockets: $ hackbench -l <Messages>
> Pipes : $ hackbench --pipe -l <Messages>
> Varying parameter: -l
>
> Machine: IBM POWER 9
>
> +----------+------------+-------------------+----------+-------------------+
> | Messages | TEO socket | Wt. TEO socket | TEO pipe | Wt. TEO pipe |
> +----------+------------+-------------------+----------+-------------------+
> | 100 | 0.042 | 0.043 ( -2.32%) | 0.031 | 0.032 ( +3.12%) |
> | 1000 | 0.258 | 0.272 ( +5.14%) | 0.301 | 0.312 ( -3.65%) |
> | 10000 | 2.397 | 2.441 ( +1.80%) | 5.642 | 5.092 ( +9.74%) |
> | 100000 | 23.691 | 23.730 ( -0.16%) | 57.762 | 57.857 ( -0.16%) |
> | 1000000 | 234.103 | 233.841 ( +0.11%) | 559.807 | 592.304 ( -5.80%) |
> +----------+------------+-------------------+----------+-------------------+
>
> Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
> Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO
>
>


Could you also provide power measurements for the duration when the
system is completely idle for each of the variants of TEO governor ?
Is it the case that the benefits that we are seeing above are only due
to Wt. TEO being more conservative than TEO governor by always
choosing a shallower state ?





>
> Pratik Rajesh Sampat (1):
> Weighted approach to gather and use history in TEO governor
>
> drivers/cpuidle/governors/teo.c | 95 +++++++++++++++++++++++++++++++--
> 1 file changed, 90 insertions(+), 5 deletions(-)
>
> --
> 2.17.1
>

--
Thanks and Regards
gautham.