Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig

From: Yuyang Du
Date: Wed Sep 23 2015 - 03:11:38 EST


On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@xxxxxxxxxx wrote:
> Yuyang Du <yuyang.du@xxxxxxxxx> writes:
>
> > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@xxxxxxxxxx wrote:
> >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
> >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> >> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
> >> >
> >> > We get rid of all scale_load_down() for runnable load average?
> >>
> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
> >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
> >> going to give errors on 32-bit (even with the old code in fact). This
> >> should probably be fixed... somehow (dividing by 4 for load_sum on
> >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
> >> 32-bit might have made sense but would be a weird difference between 32
> >> and 64, and could break userspace anyway, so it's presumably too late
> >> for that).
> >>
> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
> >> 32-bit.
> >>
> >
> > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
> > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
> > load resolution. This is ok, because struct load_weight's load is also unsigned
> > long. If overflown, cfs_rq->load.weight will be overflown in the first place.
> >
> > However, after a second thought, this is not quite right. Because load_avg is not
> > necessarily no greater than load, since load_avg has blocked load in it. Although,
> > load_avg is still at the same level as load (converging to be <= load), we may not
> > want the risk to overflow on 32bit.

This second thought made a mistake (what was wrong with me). load_avg is for sure
no greater than load with or without blocked load.

With that said, it really does not matter what the following numbers are, 32bit or
64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
whether overflow or not, not the load_avg. It is as simple as that.

With that, I think we can and should get rid of the scale_load_down() for load_avg.

> Yeah, I missed that load_sum was u64 and only load_avg was long. This
> means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
> runnable_load_avg can overflow, but only when cfs_rq load.weight does,
> so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
> tasks before load.weight overflows, and ~2^31 tasks before
> runnable_load_avg does, which is obviously fine (and in fact you hit
> PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).
>
> Now, applying SLR to runnable_load_avg would cut this down to ~2^21
> tasks running at once or 2^20 with cgroups, which is technically
> allowed, though it seems utterly implausible (especially since this
> would have to all be on one cpu). If SLR was increased as peterz asked
> about, this could be an issue though.
>
> All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
> runnable_load_avg would be fine, as they're limited to only one
> task/cgroup's weight. Having it SLRed and cfs_rq not would be a
> little odd, but not impossible.


> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get
> > nice-0's load, I don't understand why you want to separate them.
>
> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
> used in precisely one place other than the newish util_avg, and as I
> mentioned it's not remotely clear what compute_imbalance is doing theer)

Yes, it is not clear to me either.

With the above proposal to get rid of scale_load_down() for load_avg, so I think
now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
and raname scale_load_down() to kernel_to_user_load().

Hmm?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/