Re: lmbench ctxsw regression with CFS

From: Nick Piggin
Date: Mon Aug 13 2007 - 23:23:24 EST


On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote:
> On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe <jens.axboe@xxxxxxxxxx> wrote:
>
> > On Mon, Aug 06 2007, Nick Piggin wrote:
> > > > > What CPU did you get these numbers on? Do the indirect calls hurt much
> > > > > on those without an indirect predictor? (I'll try running some tests).
> > > >
> > > > it was on an older Athlon64 X2. I never saw indirect calls really
> > > > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty
> > > > efficiently? (as long as the target function is always the same - which
> > > > it is here.)
> > >
> > > I think a lot of CPUs do. I think ia64 does not. It predicts
> > > based on the contents of a branch target register which has to
> > > be loaded I presume before instructoin fetch reaches the branch.
> > > I don't know if this would hurt or not.
> >
> > Testing on ia64 showed that the indirect calls in the io scheduler hurt
> > quite a bit, so I'd be surprised if the impact here wasn't an issue
> > there.
>
> With what workload? lmbench ctxsw? Who cares?
>
> Look, if you're doing 100,000 context switches per second per then *that*
> is your problem. You suck, and making context switches a bit faster
> doesn't stop you from sucking. And ten microseconds is a very long time
> indeed.
>
> Put it this way: if a 50% slowdown in context switch times yields a 5%
> improvement in, say, balancing decisions then it's probably a net win.
>
> Guys, repeat after me: "context switch is not a fast path". Take that
> benchmark and set fire to it.

It definitely can be. For workloads that are inherently asynchronous, high
speed networking or disk IO (ie. with event generation significantly outside
the control of the kernel or app), then it can be. Sure, you may just be
switching between the main working thread and idle thread, but in that case a
slowdown in the scheduler will be _more_ pronounced because you don't have to
do as much work to actually switch contexts.

If there was a performance tradeoff involved, then we could think about it,
and you might be right. But this is just a case of "write code to do direct
calls or do indirect calls".

Ken Chen's last ia64 database benchmark I could find says schedule takes
6.5% of the clock cycles, the second highest consumer. Considering the
lengths he was going to shave cycles off other paths, I'd call schedule()
a fastpath. Would be really interesting to rerun that benchmark with CFS.
Is anyone at Intel still doing those tests?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/