Re: 2.6.21-rc1: known regressions (v2) (part 2)

From: Con Kolivas
Date: Tue Feb 27 2007 - 18:24:20 EST


Apologies for the resend, lkml address got mangled...

On Tuesday 27 February 2007 19:54, Mike Galbraith wrote:
> On Tue, 2007-02-27 at 09:33 +0100, Ingo Molnar wrote:
> > * Michal Piotrowski <michal.k.k.piotrowski@xxxxxxxxx> wrote:
> > > Thomas Gleixner napisał(a):
> > > > Adrian,
> > > >
> > > > On Mon, 2007-02-26 at 23:05 +0100, Adrian Bunk wrote:
> > > >> Subject : kernel BUG at kernel/time/tick-sched.c:168
> > > >> (CONFIG_NO_HZ) References : http://lkml.org/lkml/2007/2/16/346
> > > >> Submitter : Michal Piotrowski <michal.k.k.piotrowski@xxxxxxxxx>
> > > I can confirm that the bug is fixed (over 20 hours of testing should
> > > be enough).
> >
> > thanks alot! I think this thing was a long-term performance/latency
> > regression in HT scheduling as well.

Ingo I'm going to have to partially disagree with you on this.

This has only become a problem because of what happens with dynticks now when
rq->curr == rq->idle. Prior to this, that particular SMT code only leads to
relative delays in scheduling for lower priority tasks. Whether or not that
task is ksoftirqd should not matter because it is not like they are starved
indefinitely, it is only that nice 19 tasks are relatively delayed, which by
definition is implied with the usage of nice as a scheduler hint wouldn't you
say? I know it has been discussed many times before as to whether 'nice'
means less cpu and/or more latency, but in our current implementation, nice
means both less cpu and more latency. So to me, the kernels without dynticks
do not have a regression. This seems to only be a problem in the setting of
the new dynticks code IMHO. That's not to say it isn't a bug! Nor am I saying
that dynticks is a problem! Please don't misinterpret that.

The second issue is that this is a problem because of the fuzzy definition of
what idle is for a runqueue in the setting of this SMT code. Normally,
rq->curr==rq->idle means the runqueue is idle, but not with this code since
there are still rq->nr_running on that runqueue. What dynticks in this
implementation is doing is trying to idle a hyperthread sibling on a cpu
whose logical partner is busy. I did not find that added any power saving on
my earlier dynticks implementation, and found it easier to keep that sibling
ticking at the same rate as its partner. Of course you may have found
something different, and I definitely agree with what you are likely to say
in response to this- we shouldn't have to special case logical siblings as
having a different definition of idle than any other smp case. Ultimately,
that leaves us with your simple patch as a reasonable solution for the
dynticks case even though it does change the behaviour dramatically for a
more loaded cpu. I don't see this code as presenting a problem without or
prior to the dynticks implementation. You being the scheduler maintainer
means you get to choose what is the best way to tackle this problem.

> Agreed.
>
> I was recently looking at that spot because I found that niced tasks
> were taking latency hits, and disabled it, which helped a bunch.

Ok... as I said above to Ingo, nice means more latency too, and there is no
doubt that if we disable nice as a working feature then the niced tasks will
have less latency. Of course, this ends up meaning that un-niced tasks no
longer receive their better cpu performance.. You're basically saying that
you prefer nice not to work in the setting of HT.

> I also
> can't understand why it would be OK to interleave a normal task with an
> RT task sometimes, but not others.. that's meaningless to the RT task.

Clearly there would be a reason that code is there... The whole problem with
HT is that as soon as you load up a sibling, you slow down the logical
sibling, hence why this code is there in the first place. Since I know you're
one to test things for yourself, I will put it to you this way:

Boot into UP. Time how long it takes to do a million of these in a real time
task:
asm volatile("" : : : "memory");

Then start up a SCHED_NORMAL task fully cpu bound such as "yes > /dev/null"
and time that again. Obviously the former being a realtime task will take the
same amount of time and the SCHED_NORMAL task will be starved until the
realtime task finishes.

Now try the same experiment with hyperthreading enabled and an ordinary SMP
kernel. You'll find the realtime task runs at only ~60% performance. That's a
serious performance hit for realtime tasks considering you're running a
SCHED_NORMAL task. The SMT code that you seem to dislike fixes this problem.
The reason for interleaving is that there are a few cycles to be gained by
using the second core for a separate SCHED_NORMAL task, and you don't want to
disable access to the second core entirely for the duration the realtime task
is running. Since there is no simple relationship between SCHED_NORMAL
timeslices and realtime timeslices, we have to use some form of interleaving
based on the expected extra cycles and HZ is the obvious choice.

> IMHO, SMT scheduling should be a buyer beware thing. Maximizing your
> core utilization comes at a price, but so does disabling it, so I think
> letting the user decide what he wants is the right thing to do.

To me this is like arguing that we should not implement 'nice' within the cpu
scheduler at all and only allow nice to work on the few architectures that
support hardware priorities in the cpu (like power5). Now there is no doubt
that if we do away with nice entirely everywhere in the scheduler we'll gain
some throughput. However, nice is a basic unix/linux function and if hardware
comes along that breaks it working we should be working to make sure that it
keeps working in software. That is why smt nice and smp nice was implemented.
Of course it is our duty to ensure we do that at minimal overhead at all
times. That's a different argument to what you are debating here. The
throughput should not be adversely affected by this SMT priority code because
although the nice 19 task gets less throughput, the higher priority task gets
more as a result, which is essentially what nice is meant to do.

--
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/