Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to3.6-rc5 on AMD chipsets - bisected

From: Linus Torvalds
Date: Sun Sep 16 2012 - 15:58:09 EST


On Sat, Sep 15, 2012 at 9:35 PM, Mike Galbraith <efault@xxxxxx> wrote:
>
> Oh, while I'm thinking about it, there's another scenario that could
> cause the select_idle_sibling() change to affect pgbench on largeish
> packages, but it boils down to preemption odds as well.

So here's a possible suggestion..

Let's assume that the scheduler code to find the next idle CPU on the
package is actually a good idea, and we shouldn't mess with the idea.

But at the same time, it's clearly an *expensive* idea, which is why
you introduced the "only test a single CPU buddy" approach instead.
But that didn't work, and you can come up with multiple reasons why it
wouldn't work. Plus, quite fundamentally, it's rather understandable
that "try to find an idle CPU on the same package" really would be a
good idea, right?

So instead of limiting the "try to find an idle CPU on the same
package" to "pick *one* idle CPU on the same package to try", how
about just trying to make the whole "find another idle CPU" much
cheaper, and much more scalable that way?

Quite frankly, the loop in select_idle_sibling() is insanely expensive
for what it really wants to do. All it really wants to do is:
- iterate over idle processors on this package
- make sure that idle processor is in the cpu's allowed.

Right?

But the whole use of "cpumask_intersects()" etc is quite expensive,
and there's that crazy double loop to do the above. So no wonder that
it's expensive and causes scalability problems. That would be
*especially* true if nr_cpumask_bits is big, and we have
CONFIG_CPUMASK_OFFSTACK defined.

So I would suggest:
(a) revert the original commit (already done in my tree)
(b) look at just making the loop *much* cheaper.

For example, all those "cpumask_intersects()" and "cpumask_first()"
things are *really* expensive, and expand to tons of code especially
for the OFFSTACK case (and aren't exactly free even for the smaller
case). And it really is all stupidly and badly done. I bet we can make
that code faster without really changing the end result at all, just
changing the algorithm.

For example, what if we got rid of all the crazy "sd groups" crap at
run-time, and just built a single linked circular list of CPU's on the
same package?

Then we'd replace that crazy-expensive double loop over sd->groups and
for_each_cpu() crap (not to mention cpumask_first_and() etc) with just
a simple loop over that (and pick the *next* idle cpu, instead of
doing that crazy "pick first one in a bitmask after and'ing").

In fact, looking at select_idle_sibling(), I just want to puke. The
thing is crazy.

Why the hell isn't the *very* first thing that function does just a simple

if (idle_cpu(target))
return target;

instead it does totally f*cking insane things, and checks whether
"target == cpu && idle_cpu(cpu)".

The code is shit. Just fix the shit, instead of trying to come up with
some totally different model. Ok? I bet just fixing it to not have
insane double loops would already get 90% of the speedup that Mike's
original patch did, but without the downsides of having to pick just a
single idle-buddy.

We might also possibly add a "look at SMT buddy first" case, because
Mike is probably right that bouncing all over the package isn't
necessarily a good idea unless we really need to. But that would be a
different thing.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/