Re: [PATCH 3/3] sched: Disable affine wakeups by default

From: Mike Galbraith
Date: Sun Oct 25 2009 - 18:05:05 EST


On Sun, 2009-10-25 at 12:33 -0700, Arjan van de Ven wrote:
> On Sun, 25 Oct 2009 18:38:09 +0100
> Mike Galbraith <efault@xxxxxx> wrote:
> > > > Even if you're sharing a cache, there are reasons to wake
> > > > affine. If the wakee can preempt the waker while it's still
> > > > eligible to run, wakee not only eats toasty warm data, it can
> > > > hand the cpu back to the waker so it can make more and repeat
> > > > this procedure for a while without someone else getting in
> > > > between, and trashing cache.
> > >
> > > and on the flipside, and this is the workload I'm looking at,
> > > this is halving your performance roughly due to one core being
> > > totally busy while the other one is idle.
> >
> > Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
> > problem really well. If you can hit an idle shared cache at low load,
> > go for it every time.
>
> sadly the current code does not do this ;(
> my patch might be too big an axe for it, but it does solve this part ;)

The below fixed up pgsql+oltp low end, but has negative effect on high
end. Must be some stuttering going on.

> I'll keep digging to see if we can do a more micro-incursion.
>
> > Hm. That looks like a bug, but after any task has scheduled a few
> > times, if it looks like a synchronous task, it'll glue itself to it's
> > waker's runqueue regardless. Initial wakeup may disperse, but it will
> > come back if it's not overlapping.
>
> the problem is the "synchronous to WHAT" question.
> It may be synchronous to the disk for example; in the testcase I'm
> looking at, we get "send message to X. do some more code. hit a page
> cache miss and do IO" quite a bit.

Hm. Yes, disk could be problematic. It's going to be exactly what the
affinity code looks for, you wake somebody, and almost immediately go to
sleep. OTOH, even a house keeper threads make warm data.

> > > The numbers you posted are for a database, and only measure
> > > throughput. There's more to the world than just databases /
> > > throughput-only computing, and I'm trying to find low impact ways
> > > to reduce the latency aspect of things. One obvious candidate is
> > > hyperthreading/SMT where it IS basically free to switch to a
> > > sibbling, so wake-affine does not really make sense there.
> >
> > It's also almost free on my Q6600 if we aimed for idle shared cache.
>
> yeah multicore with shared cache falls for me in the same bucket.

Anyone with a non-shared cache multicore would be most unhappy with my
little test hack.

> > I agree fully that affinity decisions could be more perfect than they
> > are. Getting it wrong is very expensive either way.
>
> Looks like we agree on a key principle:
> If there is a free cpu "close enough" (SMT or MC basically), the
> wakee should just run on that.
>
> we may not agree on what to do if there's no completely free logical
> cpu, but a much lighter loaded one instead.
> but first we need to let code speak ;)

mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47 3x avg
tip+ 10071.16 18498.33 34697.17 34275.20 32761.96 31657.10 30223.70 27363.50 24698.71
9971.57 18290.17 34632.46 34204.59 32588.94 31513.19 30081.51 27504.66 24832.24
9884.04 18502.26 34650.08 34250.13 32707.81 31566.86 29954.19 27417.09 24811.75


pgsql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94 3x avg
tip+ 15163.56 28882.70 52374.32 52469.79 51739.79 50602.02 49827.18 48029.84 46191.90
15258.65 28778.77 52716.46 52405.32 51434.21 50440.66 49718.89 48082.22 46124.56
15278.02 28178.55 52815.82 52609.98 51729.17 50652.10 49800.19 48126.95 46286.58


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 37087a7..fa534f0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1374,6 +1374,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag

rcu_read_lock();
for_each_domain(cpu, tmp) {
+ int level = tmp->level;
+
/*
* If power savings logic is enabled for a domain, see if we
* are not overloaded, if so, don't balance wider.
@@ -1398,11 +1400,28 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
want_sd = 0;
}

+ /*
+ * look for an idle shared cache before looking at last CPU.
+ */
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
- cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+ (level == SD_LV_SIBLING || level == SD_LV_MC)) {
+ int i;

+ for_each_cpu(i, sched_domain_span(tmp)) {
+ if (!cpu_rq(i)->cfs.nr_running) {
+ affine_sd = tmp;
+ want_affine = 0;
+ cpu = i;
+ }
+ }
+ } else if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+ cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
affine_sd = tmp;
want_affine = 0;
+
+ if ((level == SD_LV_SIBLING || level == SD_LV_MC) &&
+ !cpu_rq(prev_cpu)->cfs.nr_running)
+ cpu = prev_cpu;
}

if (!want_sd && !want_affine)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/