Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

From: Rafael J. Wysocki
Date: Tue Oct 08 2019 - 05:51:45 EST


On Tue, Oct 8, 2019 at 8:20 AM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
>
> On 2019.10.06 08:34 Rafael J. Wysocki wrote:
> > On Sun, Oct 6, 2019 at 4:46 PM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> >> On 2019.10.01 02:32 Rafael J. Wysocki wrote:
> >>> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> >>>> On 2019.09.26 09:32 Doug Smythies wrote:
> >>>>
> >>>>> If the deepest idle state is disabled, the system
> >>>>> can become somewhat unstable, with anywhere between no problem
> >>>>> at all, to the occasional temporary jump using a lot more
> >>>>> power for a few seconds, to a permanent jump using a lot more
> >>>>> power continuously. I have been unable to isolate the exact
> >>>>> test load conditions under which this will occur. However,
> >>>>> temporarily disabling and then enabling other idle states
> >>>>> seems to make for a somewhat repeatable test. It is important
> >>>>> to note that the issue occurs with only ever disabling the deepest
> >>>>> idle state, just not reliably.
> >>>>>
> >>>>> I want to know how you want to proceed before I do a bunch of
> >>>>> regression testing.
> >>>>
> >> I do not think I stated it clearly before: The problem here is that some CPUs
> >> seem to get stuck in idle state 0, and when they do power consumption spikes,
> >> often by several hundred % and often indefinitely.
> >
> > That indeed has not been clear to me, thanks for the clarification!
>
> >
> >> I made a hack job automated test:
> >> Kernel tests fail rate
> >> 5.4-rc1 6616 13.45%
> >> 5.3 2376 4.50%
> >> 5.3-teov7 12136 0.00% <<< teo.c reverted and teov7 put in its place.
> >> 5.4-rc1-ds 11168 0.00% <<< [old] proposed patch (> 7 hours test time)
>
>
> 5.4-rc1-ds12 4224 0.005 <<< new proposed patch
>
> >>
> >> [old] Proposed patch (on top of kernel 5.4-rc1): [deleted]
>
> > This change may cause the deepest state to be selected even if its
> > "hits" metric is less than the "misses" one AFAICS, in which case the
> > max_early_index state should be selected instead.
> >
> > It looks like the max_early_index computation is broken when the
> > deepest state is disabled.
>
> O.K. Thanks for your quick reply, and insight.
>
> I think long durations always need to be counted, but currently if
> the deepest idle state is disabled, they are not.
> How about this?:
> (test results added above, more tests pending if this might be a path forward.)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index b5a0e49..a970d2c 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -155,10 +155,12 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>
> cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;
>
> - if (drv->states[i].target_residency <= sleep_length_us) {
> - idx_timer = i;
> - if (drv->states[i].target_residency <= measured_us)
> - idx_hit = i;
> + if (!(drv->states[i].disabled || dev->states_usage[i].disable)){
> + if (drv->states[i].target_residency <= sleep_length_us) {
> + idx_timer = i;
> + if (drv->states[i].target_residency <= measured_us)
> + idx_hit = i;
> + }

What if the state is enabled again after some time?

> }
> }
>
> @@ -256,39 +258,25 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> struct cpuidle_state *s = &drv->states[i];
> struct cpuidle_state_usage *su = &dev->states_usage[i];
>
> - if (s->disabled || su->disable) {
> - /*
> - * If the "early hits" metric of a disabled state is
> - * greater than the current maximum, it should be taken
> - * into account, because it would be a mistake to select
> - * a deeper state with lower "early hits" metric. The
> - * index cannot be changed to point to it, however, so
> - * just increase the max count alone and let the index
> - * still point to a shallower idle state.
> - */
> - if (max_early_idx >= 0 &&
> - count < cpu_data->states[i].early_hits)
> - count = cpu_data->states[i].early_hits;
> -
> - continue;

AFAICS, adding early_hits to count is not a mistake if there are still
enabled states deeper than the current one.

Besides, can you just leave the "continue" here instead of changing
the indentation level for everything below?

> - }
>
> - if (idx < 0)
> - idx = i; /* first enabled state */
> + if (!(s->disabled || su->disable)) {
> + if (idx < 0)
> + idx = i; /* first enabled state */
>
> - if (s->target_residency > duration_us)
> - break;
> + if (s->target_residency > duration_us)
> + break;
>
> - if (s->exit_latency > latency_req && constraint_idx > i)
> - constraint_idx = i;
> + if (s->exit_latency > latency_req && constraint_idx > i)
> + constraint_idx = i;
>
> - idx = i;
> + idx = i;
>
> - if (count < cpu_data->states[i].early_hits &&
> - !(tick_nohz_tick_stopped() &&
> - drv->states[i].target_residency < TICK_USEC)) {
> - count = cpu_data->states[i].early_hits;
> - max_early_idx = i;
> + if (count < cpu_data->states[i].early_hits &&
> + !(tick_nohz_tick_stopped() &&
> + drv->states[i].target_residency < TICK_USEC)) {
> + count = cpu_data->states[i].early_hits;
> + max_early_idx = i;
> + }
> }
> }