Re: [patch] Re: PostgreSQL pgbench performance regression in2.6.23+

From: Mike Galbraith
Date: Tue May 27 2008 - 04:21:33 EST



On Tue, 2008-05-27 at 07:59 +0200, Mike Galbraith wrote:
> On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:
>
> > 2) Mike suggested a patch to 2.6.25 in this thread that backports
> the
> > feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
> > push that into 2.6.25.5? It's clearly quite useful for this load and
> > therefore possibly others.
>
> If the patch above flies, imho, that should be folded into the backport
> to stable. The heart of the patch is a bugfix, so definitely stable
> material. Whether to enable the debugging/tweaking knobs as well is a
> different question. (I would do it, but I ain't the maintainer;)

Hm, pbench's extreme dislike of preemption, and the starvation testcase
I sent earlier having an absolute requirement of preemption kinda argues
that some knobs and dials should be per task or task group (or, or... or
scheduler should be all knowing all seeing;)

2.6.25.4-feat=45 2.6.25.4-feat=111 2.6.25.4-feat=47
1 11385.471887 10292.721924 9551.157672
2 16709.515434 15540.399522 16283.968970
3 25456.658841 20187.320016 24562.735943
4 24453.435157 24975.037450 23391.583053
5 25504.302958 23102.131056 23671.860667
6 27076.359200 24688.791507 25947.592071
8 31758.200682 29462.639752 29700.144372
10 32190.081142 30428.413809 27439.441838
15 31175.074906 11097.668025 20344.284129
20 30513.974332 10742.166624 19256.695409
30 28307.399275 10233.708047 17535.423344
40 26720.463867 10037.856773 16104.895695
50 24899.945793 9907.624283 15768.746911

Anyway, if patchlet flies, and Ingo concurs, I'll submit the below.

Prevent short-running wakers of short-running threads from overloading a
single
cpu via wakeup affinity, and provide affinity related debug/tuning
options.

Signed-off-by: Mike Galbraith <efault@xxxxxx>

kernel/sched.c | 9 ++++++++-
kernel/sched_fair.c | 25 ++++++++++++++-----------
2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..d6d70a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,8 @@ enum {
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_HRTICK = 8,
SCHED_FEAT_DOUBLE_TICK = 16,
+ SCHED_FEAT_AFFINE_WAKEUPS = 32,
+ SCHED_FEAT_SYNC_WAKEUPS = 64,
};

const_debug unsigned int sysctl_sched_features =
@@ -603,7 +605,9 @@ const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_HRTICK * 1 |
- SCHED_FEAT_DOUBLE_TICK * 0;
+ SCHED_FEAT_DOUBLE_TICK * 0 |
+ SCHED_FEAT_AFFINE_WAKEUPS * 1 |
+ SCHED_FEAT_SYNC_WAKEUPS * 1;

#define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)

@@ -1902,6 +1906,9 @@ static int try_to_wake_up(struct task_struct *p,
unsigned int state, int sync)
long old_state;
struct rq *rq;

+ if (!sched_feat(SYNC_WAKEUPS))
+ sync = 0;
+
smp_wmb();
rq = task_rq_lock(p, &flags);
old_state = p->state;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0080968..bf2defe 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -992,16 +992,27 @@ wake_affine(struct rq *rq, struct sched_domain
*this_sd, struct rq *this_rq,
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;

- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;

/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1010,16 +1021,8 @@ wake_affine(struct rq *rq, struct sched_domain
*this_sd, struct rq *this_rq,
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);

- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/