Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rqcpu usage

From: Peter Zijlstra
Date: Fri Feb 25 2011 - 07:31:40 EST

Next message: Artem Bityutskiy: "Re: [PATCH] mtd: nand: fix one typo in the comments"
Previous message: Jonathan Cameron: "Re: [PATCH: 2.6.37 1/1] TAOS 258x: Device Driver"
In reply to: Paul Turner: "Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2011-02-24 at 19:33 -0800, Paul Turner wrote:

> >> #ifdef CONFIG_CFS_BANDWIDTH
> >> +static u64 tg_request_cfs_quota(struct task_group *tg)
> >> +{
> >> + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> >> + u64 delta = 0;
> >> +
> >> + if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> >> + raw_spin_lock(&cfs_b->lock);
> >> + /*
> >> + * it's possible a bandwidth update has changed the global
> >> + * pool.
> >> + */
> >> + if (cfs_b->quota == RUNTIME_INF)
> >> + delta = sched_cfs_bandwidth_slice();
> >
> > Why do we bother at all when there's infinite time? Shouldn't the action
> > that sets it to infinite also make cfs_rq->quota_assinged to to
> > RUNTIME_INF, in which case the below check will make it all go away?

> But more importantly we don't want to decrement the value doled out
> FROM cfs_b->runtime since that would change it from the magic
> RUNTIME_INF. That's why the check exists.

Ah, quite so.

> >> + else {
> >> + delta = min(cfs_b->runtime,
> >> + sched_cfs_bandwidth_slice());
> >> + cfs_b->runtime -= delta;
> >> + }
> >> + raw_spin_unlock(&cfs_b->lock);
> >> + }
> >> + return delta;
> >> +}
> >
> > Also, shouldn't this all try and steal time from other cpus when the
> > global limit ran out? Suppose you have say 24 cpus, and you had a short
> > burst where all 24 cpus had some runtime, so you distribute 240ms. But
> > 23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
> > cpus while your one active cpu will then throttle.
> >
>
> In practice this only affects the first period since that slightly
> stale bandwidth is then available on those other 23 cpus the next time
> a micro-burst occurs. In testing this has resulted in very stable
> performance and "smooth" perturbations that resemble hardcapping by
> affinity (for integer points).
>
> The notion of stealing could certainly be introduced, the juncture of
> reaching the zero point here would be a possible place to consider
> that (although we would need to do a steal that avoids the asymptotic
> convergence problem of RT).
>
> I think returning (most) of the bandwidth to the global pool on
> (voluntary) dequeue is a more scalable solution
>
> > I would much rather see all the accounting tight first and optimize
> > later than start with leaky stuff and try and plug holes later.
> >
>
> The complexity this introduces is non-trivial since the idea of
> returning quota to the pool means you need to introduce the notion of
> when that quota came to life (otherwise you get leaks in the reverse
> direction!) -- as well as reversing some of the lock ordering.

Right, nasty that, RT doesn't suffer this because of the lack of
over-commit.

> While generations do this they don't greatly increase the efficacy and
> I think there is value in performing the detailed review we are doing
> now in isolation of that.
>
> It's also still consistent regarding leakage since in that in any N
> consecutive periods the maximum additional quota (by a user abusing
> this) that can be received is N+1. Does the complexity trade-off
> merit improving this bound at this point?

Well, something yes, with N being potentially very large indeed these
days we need some feedback.

One idea would be to keep a cpu mask in the bandwidth thing and setting
the cpu when a cpu claims bandwidth from the global pool and iterate and
clear the complete mask on tick.

That also limits the scope on where to look for stealing time.

> >> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >> + unsigned long delta_exec)
> >> +{
> >> + if (cfs_rq->quota_assigned == RUNTIME_INF)
> >> + return;
> >> +
> >> + cfs_rq->quota_used += delta_exec;
> >> +
> >> + if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> >> + return;
> >> +
> >> + cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> >> +}
> >
> > So why isn't this hierarchical?,
>
> It is naturally (since charging occurs within the existing hierarchal
> accounting)

D'0h yes.. somehow I totally missed that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Artem Bityutskiy: "Re: [PATCH] mtd: nand: fix one typo in the comments"
Previous message: Jonathan Cameron: "Re: [PATCH: 2.6.37 1/1] TAOS 258x: Device Driver"
In reply to: Paul Turner: "Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]