Re: [PATCH] sched: Avoid side-effect of tickless idle onupdate_cpu_load

From: Peter Zijlstra
Date: Wed May 12 2010 - 06:55:10 EST


On Fri, 2010-05-07 at 18:48 -0700, Venkatesh Pallipadi wrote:
> tickless idle has a negative side effect on update_cpu_load(),
> which in turn can affect load balancing behavior.
>
> update_cpu_load() is supposed to be called every tick, to keep track of
> various load indicies. With tickless idle, there are no scheduler ticks called
> on the idle CPUs. Idle CPUs may still do load balancing (with idle_load_balance
> CPU) using the stale cpu_load. It will also cause problems when all CPUs go
> idle for a while and become active again. In this case loads would not degrade
> as expected.
>
> This is how rq->nr_load_updates change looks like under different conditions:

<snip>

> That is update_cpu_load works properly only when all CPUs are busy.
> If all are idle, all the CPUs get way lower updates.
> And when few CPUs are busy and rest are idle, only busy and ilb does
> proper updates and rest of the idle CPUs will get lower updates.
>
> The patch keeps track of when a last update was done and fixes up
> the load avg based on current time.
>
> On one of my test system SPECjbb with warehouse 1..numcpus, patch improves
> throughput numbers by ~1% (average of 6 runs).
> On another test system (with different domain hierarchy) there is no
> noticable change in perf.

Ah, I had wondered about this aspect of nohz at one time. Nice you've
investigated and measured the performance impact.

I can largely find myself in the solution, but some comments below.

> Signed-off-by: Venkatesh Pallipadi <venki@xxxxxxxxxx>
> ---
> kernel/sched.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++---
> kernel/sched_fair.c | 5 ++-
> 2 files changed, 81 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3c2a54f..0abd7db 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -502,6 +502,7 @@ struct rq {
> unsigned long nr_running;
> #define CPU_LOAD_IDX_MAX 5
> unsigned long cpu_load[CPU_LOAD_IDX_MAX];
> + unsigned long last_load_update_tick;
> #ifdef CONFIG_NO_HZ
> unsigned char in_nohz_recently;
> #endif
> @@ -1816,6 +1817,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
> static void calc_load_account_active(struct rq *this_rq);
> static void update_sysctl(void);
> static int get_update_sysctl_factor(void);
> +static void update_cpu_load(struct rq *this_rq);
>
> static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
> {
> @@ -3088,23 +3090,84 @@ static void calc_load_account_active(struct rq *this_rq)
> }
>
> /*
> + * Load degrade calculations below are approximated on a 128 point scale.
> + * degrade_zero_ticks is the number of ticks after which old_load at any
> + * particular idx is approximated to be zero.
> + * degrade_factor is a precomputed table, a row for each load idx.
> + * Each column corresponds to degradation factor for a power of two ticks,
> + * based on 128 point scale.
> + * Example:
> + * row 2, col 3 (=12) says that the degradation at load idx 2 after
> + * 8 ticks is 12/128 (which is an approximation of 3^8/4^8).
> + */

This comment utterly forgets to explain why. Does the degradation factor
correspond with the decay otherwise used? Maybe explicitly mention that
function and clarify the whole cpu_load math.

> +#define DEGRADE_SHIFT 7
> +static const unsigned char
> + degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128};
> +static const unsigned char
> + degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = {
> + {0, 0, 0, 0, 0, 0, 0, 0},
> + {64, 32, 8, 0, 0, 0, 0, 0},
> + {96, 72, 40, 12, 1, 0, 0},
> + {112, 98, 75, 43, 15, 1, 0},
> + {120, 112, 98, 76, 45, 16, 2} };
> +
> +/*
> + * Update cpu_load for any backlog'd ticks. The backlog would be when
> + * CPU is idle and so we just decay the old load without adding any new load.
> + */
> +static unsigned long update_backlog(unsigned long load,
> + unsigned long missed_updates, int idx)
> +{
> + int j = 0;
> +
> + if (missed_updates >= degrade_zero_ticks[idx])
> + return 0;
> +
> + if (idx == 1)
> + return load >> missed_updates;
> +
> + while (missed_updates) {
> + if (missed_updates % 2)
> + load =(load * degrade_factor[idx][j]) >> DEGRADE_SHIFT;
> +
> + missed_updates >>= 1;
> + j++;
> + }
> + return load;
> +}
> +
> +/*
> * Update rq->cpu_load[] statistics. This function is usually called every
> - * scheduler tick (TICK_NSEC).
> + * scheduler tick (TICK_NSEC). With tickless idle this will not be called
> + * every tick. We fix it up based on jiffies.
> */
> static void update_cpu_load(struct rq *this_rq)
> {
> unsigned long this_load = this_rq->load.weight;
> + unsigned long curr_jiffies = jiffies;
> + unsigned long pending_updates, missed_updates;
> int i, scale;
>
> this_rq->nr_load_updates++;
>
> + if (curr_jiffies == this_rq->last_load_update_tick)
> + return;

Under which conditions can this happen? Going idle right after having
had the tick?

> + pending_updates = curr_jiffies - this_rq->last_load_update_tick;
> + this_rq->last_load_update_tick = curr_jiffies;
> + missed_updates = pending_updates - 1;
> +
> /* Update our load: */
> - for (i = 0, scale = 1; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
> + this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */

Why is this special case worth it?

> + for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
> unsigned long old_load, new_load;
>
> /* scale is effectively 1 << i now, and >> i divides by scale */
>
> old_load = this_rq->cpu_load[i];
> + if (missed_updates)
> + old_load = update_backlog(old_load, missed_updates, i);

Would it make sense to stuff that conditional in update_backlog() and
have a clearer flow? Maybe rename update_backlog() to decay_load() or
such?


~ Peter


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/