[patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

From: Mike Galbraith
Date: Tue May 27 2008 - 01:59:25 EST



On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:

> I did again get useful results here with the stock 2.6.26.git kernel and
> default parameters using Peter's small patch to adjust se.waker.
>
> What I found most interesting was how the results changed when I set
> /proc/sys/kernel/sched_features = 0, without doing anything with batch
> mode. The default for that is 1101111111=895. What I then did was run
> through setting each of those bits off one by one to see which feature(s)
> were getting in the way here. The two that mattered a lot were 895-32=863
> (no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no
> SCHED_FEAT_WAKEUP_PREEMPT). Combining those two but keeping the rest of
> the features on (895-32-2=861) actually gave the best result I've ever
> seen here, better than with all the features disabled. Tossing out all
> the tests I did that didn't show anything useful, here's my table of the
> interesting results.
>
> Clients .22.19 .26.git waker f=0 f=893 f=863 f=861
> 1 7660 11043 11041 9214 11204 9232 9433
> 2 17798 11452 16306 16916 11165 16686 16097
> 3 29612 13231 18476 24202 11348 26210 26906
> 4 25584 13053 17942 26639 11331 25094 25679
> 6 25295 12263 18472 28918 11761 30525 33297
> 8 24344 11748 19109 32730 12190 31775 35912
> 10 23963 11612 19537 31688 12331 29644 36215
> 15 23026 11414 19518 33209 13050 28651 36452
> 20 22549 11332 19029 32583 13544 25776 35707
> 30 22074 10743 18884 32447 14191 21772 33501
> 40 21495 10406 18609 31704 11017 20600 32743
> 50 20051 10534 17478 29483 14683 19949 31047
> 60 18690 9816 17467 28614 14817 18681 29576
>
> Note that compared to earlier test runs, I replaced the 5 client case with
> a 60 client one to get more data on the top end. I also wouldn't pay too
> much attention to the single client case; that one really bounces around a
> lot on most of the kernel revs, even with me doing 5 runs and using the
> median.

Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).

With default features, I get the below.

2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434

(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)

Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@xxxxxx>

kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)

Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;

- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;

/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);

- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and


> Some still open questions after this long investigation that I'd like to
> know the answers to are:
>
> 1) Why are my 2.6.26.git results so dramatically worse than the ones Mike
> posted? I'm not sure what was different about his test setup here. The
> 2.6.22 results are pretty similar, and the fully tuned ones as well, so
> the big difference on that column bugs me.

Because those were git _with_ Peter's patch. I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.

> 2) Mike suggested a patch to 2.6.25 in this thread that backports the
> feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
> push that into 2.6.25.5? It's clearly quite useful for this load and
> therefore possibly others.

If the patch above flies, imho, that should be folded into the backport
to stable. The heart of the patch is a bugfix, so definitely stable
material. Whether to enable the debugging/tweaking knobs as well is a
different question. (I would do it, but I ain't the maintainer;)

> 3) Peter's se.waker patch is a big step forward on this workload without
> any tuning, closing a significant amount of the gap between the default
> setup and what I get with the two troublesome features turned off
> altogether. What issues might there be with pushing that into the stock
> {2.6.25|2.6.26} kernel?

(it doesn't fully address the 1:N needs, that needs much more thought)

> 4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS
> and SCHED_FEAT_WAKEUP_PREEMPT are disabled? I'd think that any attempt to
> tune those code paths would need my case for "works better when
> SYNC/PREEMPT wakeups disabled" as well as a case that works worse to
> balance modifications against.

I suspect very many, certainly any load where latency is of major
importance. Peak performance of the mysql+oltp benchmark for sure is
injured with your settings. As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.

> 5) Once (4) has identified some tests cases, what else might be done to
> make the default behavior better without killing the situations it's
> intended for?

See patch :)

-Mike
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/wait.h>

volatile unsigned long loop = 10000000;

void
handler (int n)
{
if (loop > 0)
--loop;
}

static int
child (void)
{
pid_t ppid = getppid ();

sleep (1);
while (1)
kill (ppid, SIGUSR1);
return 0;
}

int
main (int argc, char **argv)
{
pid_t child_pid;
int r;

loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000;
printf ("expecting to receive %lu signals\n", loop);

if ((child_pid = fork ()) == 0)
exit (child ());

signal (SIGUSR1, handler);
while (loop)
sleep (1);
r = kill (child_pid, SIGTERM);
waitpid (child_pid, NULL, 0);
return 0;
}