Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting tillafter load update period

From: Thomas Gleixner
Date: Thu Apr 01 2010 - 15:39:14 EST


On Thu, 1 Apr 2010, Andrew Morton wrote:
> On Mon, 29 Mar 2010 09:41:12 -0400
> Chase Douglas <chase.douglas@xxxxxxxxxxxxx> wrote:
>
> > There's a period of 10 ticks where calc_load_tasks is updated by all the
> > cpus for the load avg. Usually all the cpus do this during the first
> > tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> > However, if they wake up calc_load_tasks is not incremented. Thus, if
> > cpus go idle during the 10 tick period, calc_load_tasks may be
> > decremented to a non-representative value. This issue can lead to
> > systems having a load avg of exactly 0, even though the real load avg
> > could theoretically be up to NR_CPUS.
> >
> > This change defers calc_load_tasks accounting after each cpu updates the
> > count until after the 10 tick period.
> >
> > BugLink: http://bugs.launchpad.net/bugs/513848
> >
>
> There was useful information in the [patch 0/1] email, such as the
> offending commit ID. If possible, it's best to avoid the [patch 0/n]
> thing altogether - that information either has to be moved into the
> [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
> lost.
>
>
> > ---
> > kernel/sched.c | 16 ++++++++++++++--
> > 1 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 9ab3cd7..c0aedac 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3064,7 +3064,8 @@ void calc_global_load(void)
> > */
> > static void calc_load_account_active(struct rq *this_rq)
> > {
> > - long nr_active, delta;
> > + static atomic_long_t deferred;
> > + long nr_active, delta, deferred_delta;
> >
> > nr_active = this_rq->nr_running;
> > nr_active += (long) this_rq->nr_uninterruptible;
> > @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
> > if (nr_active != this_rq->calc_load_active) {
> > delta = nr_active - this_rq->calc_load_active;
> > this_rq->calc_load_active = nr_active;
> > +
> > + /* Need to defer idle accounting during load update period: */
> > + if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> > + time_after_eq(jiffies, calc_load_update))) {
> > + atomic_long_add(delta, &deferred);
> > + return;
> > + }
>
> That seems a sensible way to avoid the gross-looking "10 ticks" thing.
>
> What was the reason for "update the avenrun load estimates 10 ticks
> after the CPUs have updated calc_load_tasks"? Can we do something
> smarter there to fix this?

The reason was to make sure that all cpus have updated their stuff
halfways before we start to calculate and we spread out stuff among
cpus to avoid contention on those global atomic variables.

Yes, we can do something smarter. First thing is to fix the
nr_uninterruptible accounting which can become negative. Peter looked
into that already and we really need to fix this first before doing
anything smart about that load avg stuff.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/