Re: PostgreSQL pgbench performance regression in 2.6.23+

From: Greg Smith
Date: Fri May 23 2008 - 03:17:58 EST


On Thu, 22 May 2008, Peter Zijlstra wrote:

I picked the wake_affine() condition, because I think that is the
biggest factor in this behaviour.

I tested out Peter's patch (updated version against -rc3 with a typo fix from Mike below) and it's a big step in the right direction. Here are updated results from my benchmark script, adding 2.6.26-rc3 and that rev with this patch applied:

Clients 2.6.22 2.6.24 2.6.25 -rc3 patch
1 11052 10526 10700 10193 10439
2 16352 14447 10370 9817 13289
3 15414 17784 9403 9428 13678
4 14290 16832 8882 9533 13033
5 14211 16356 8527 9558 12790
6 13291 16763 9473 9367 12660
8 12374 15343 9093 9159 12357
10 11218 10732 9057 8711 11839
15 11116 7460 7113 7620 11267
20 11412 7171 7017 7707 10531
30 11191 7049 6896 7195 9766
40 11062 7001 6820 7079 9668
50 11255 6915 6797 7202 9588

Exact versions I tested because I think it may start mattering now: 2.6.22.19, 2.6.24.3, 2.6.25. I didn't save 2.6.23 results but recall them being similar to 2.6.24.

On this dual-core system, without this patch there's an average of a a 33% regression in -rc3 compared to 2.6.22. With it that's dropped to 8%; some cases (around 10 clients) even improve a touch (it's enough within the margin of error here I wouldn't conclude too much from that). The big jump in high client count cases is the first I've seen that since CFS was introduced. It seems a bit odd to me that there's still such a large regression in the 2-8 client cases compared with not only 2.6.22 but 2.6.24, which owned this benchmark in that area.

With this feedback, any ideas on where to go next? There seems like's some room for improvement still left here.


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5395a61..e160f71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -965,6 +965,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;

+ struct sched_entity *waker;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e24ecd3..9db3cb4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1066,7 +1066,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && curr->sched_class == &fair_sched_class &&
+ p->se.waker == curr->se.waker) {
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1238,6 +1239,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
if (unlikely(se == pse))
return;

+ se->waker = pse;
cfs_rq_of(pse)->next = pse;

/*

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/