Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig

From: bsegall
Date: Mon Sep 21 2015 - 13:30:15 EST


Yuyang Du <yuyang.du@xxxxxxxxx> writes:

> On Thu, Sep 17, 2015 at 12:38:25PM +0200, Peter Zijlstra wrote:
>> On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:
>>
>> > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
>> > mess?
>>
>> I suspect his patch will fail to compile on ARM which uses
>> SCHED_CAPACITY_* outside of kernel/sched/*.
>>
>> But if you all (Ben, Yuyang, you) can agree on a patch simplifying these
>> things I'm not opposed to it.
>
> Yes, indeed. So SCHED_RESOLUTION_SHIFT has to be defined in include/linux/sched.h.
>
> With this, I think the codes still need some cleanup, and importantly
> documentation.
>
> But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> or low resolution. So we have no reason to have low resolution (10bits) load_avg
> when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> as opposed to now we have load_avg = runnable% * scale_load_down(load).
>
> We get rid of all scale_load_down() for runnable load average?

Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
going to give errors on 32-bit (even with the old code in fact). This
should probably be fixed... somehow (dividing by 4 for load_sum on
32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
32-bit might have made sense but would be a weird difference between 32
and 64, and could break userspace anyway, so it's presumably too late
for that).

64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
32-bit.

>
> --
>
> Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
> definition
>
> The metric needs certain resolution to determine how much detail we
> can look into (or not losing detail by integer rounding), which also
> determines the range of the metrics.
>
> For instance, to increase the resolution of [0, 1] (two levels), one
> can multiply 1024 and get [0, 1024] (1025 levels).
>
> In sched/fair, a few metrics depend on the resolution: load/load_avg,
> util_avg, and capacity (frequency adjustment). In order to reduce the
> risks to make mistakes relating to resolution/range, we therefore
> generalize the resolution by defining a basic resolution constant
> number, and then formalize all metrics by depending on the basic
> resolution. The basic resolution is 1024 or (1 << 10). Further, one
> can recursively apply the basic resolution to increase the final
> resolution.
>
> Pointed out by Ben Segall, NICE_0's weight (visible to user) and load
> have independent resolution, but they must be well calibrated.
>
> Signed-off-by: Yuyang Du <yuyang.du@xxxxxxxxx>
> ---
> include/linux/sched.h | 9 ++++++---
> kernel/sched/fair.c | 4 ----
> kernel/sched/sched.h | 15 ++++++++++-----
> 3 files changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bd38b3e..9b86f79 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -909,10 +909,13 @@ enum cpu_idle_type {
> CPU_MAX_IDLE_TYPES
> };
>
> +# define SCHED_RESOLUTION_SHIFT 10
> +# define SCHED_RESOLUTION_SCALE (1L << SCHED_RESOLUTION_SHIFT)
> +
> /*
> * Increase resolution of cpu_capacity calculations
> */
> -#define SCHED_CAPACITY_SHIFT 10
> +#define SCHED_CAPACITY_SHIFT SCHED_RESOLUTION_SHIFT
> #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
>
> /*
> @@ -1180,8 +1183,8 @@ struct load_weight {
> * 1) load_avg factors frequency scaling into the amount of time that a
> * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
> * aggregated such weights of all runnable and blocked sched_entities.
> - * 2) util_avg factors frequency and cpu scaling into the amount of time
> - * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
> + * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
> + * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
> * For cfs_rq, it is the aggregated such times of all runnable and
> * blocked sched_entities.
> * The 64 bit load_sum can:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..c61fd8e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2522,10 +2522,6 @@ static u32 __compute_runnable_contrib(u64 n)
> return contrib + runnable_avg_yN_sum[n];
> }
>
> -#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> -#error "load tracking assumes 2^10 as unit"
> -#endif
> -
> #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3845a71..31b4022 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -53,18 +53,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
> * increased costs.
> */
> #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */
> -# define SCHED_LOAD_RESOLUTION 10
> -# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
> -# define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION)
> +# define SCHED_LOAD_SHIFT (SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
> +# define scale_load(w) ((w) << SCHED_RESOLUTION_SHIFT)
> +# define scale_load_down(w) ((w) >> SCHED_RESOLUTION_SHIFT)
> #else
> -# define SCHED_LOAD_RESOLUTION 0
> +# define SCHED_LOAD_SHIFT (SCHED_RESOLUTION_SHIFT)
> # define scale_load(w) (w)
> # define scale_load_down(w) (w)
> #endif
>
> -#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION)
> #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>
> +/*
> + * NICE_0's weight (visible to user) and its load (invisible to user) have
> + * independent resolution, but they should be well calibrated. We use scale_load()
> + * and scale_load_down(w) to convert between them, the following must be true:
> + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
> + */
> #define NICE_0_LOAD SCHED_LOAD_SCALE
> #define NICE_0_SHIFT SCHED_LOAD_SHIFT

I still think tying the scale_load shift to be the same as the
SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
worse. Honestly if I was going to change anything it would be to define
NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.

However I'm not sure if calculate_imbalance's use of SCHED_LOAD_SCALE is
actually a separate use of 1024*SLR-as-percentage or is basically
assuming most tasks are nice-0 or what. It sure /looks/ like it's
comparing values with different units - it's doing (nr_running * CONST -
group_capacity) and comparing to load, so it looks like both (ie
increasing load.weight of everything on your system by X% would change
load balancer behavior here).

Given that it might make sense to make it clear that capacity units and
nice-0-task units have to be the same thing due to load balancer
approximations (though they are still entirely separate from the
SCHED_LOAD_RESOLUTION multiplier).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/